**Sriram Sankaranarayanan Natasha Sharygina (Eds.)**

# **Tools and Algorithms for the Construction and Analysis of Systems**

**29th International Conference, TACAS 2023 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Paris, France, April 22–27, 2023 Proceedings, Part I**

## Lecture Notes in Computer Science 13993

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

## Editorial Board Members

Elisa Bertino, USA Wen Gao, China

Bernhard Steffen , Germany Moti Yung , USA

## Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this series at https://link.springer.com/bookseries/558

Sriram Sankaranarayanan • Natasha Sharygina Editors

# Tools and Algorithms for the Construction and Analysis of Systems

29th International Conference, TACAS 2023 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022 Paris, France, April 22–27, 2023 Proceedings, Part I

Editors Sriram Sankaranarayanan University of Colorado Boulder, CO, USA

Natasha Sharygina University of Lugano Lugano, Switzerland

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-30822-2 ISBN 978-3-031-30823-9 (eBook) https://doi.org/10.1007/978-3-031-30823-9

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## ETAPS Foreword

Welcome to the 26th ETAPS! ETAPS 2023 took place in Paris, the beautiful capital of France. ETAPS 2023 was the 26th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organising these conferences in a coherent, highly synchronized conference programme enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops took place that attracted many researchers from all over the globe.

ETAPS 2023 received 361 submissions in total, 124 of which were accepted, yielding an overall acceptance rate of 34.3%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2023 featured the unifying invited speakers Véronique Cortier (CNRS, LORIA laboratory, France) and Thomas A. Henzinger (Institute of Science and Technology, Austria) and the conference-specific invited speakers Mooly Sagiv (Tel Aviv University, Israel) for ESOP and Sven Apel (Saarland University, Germany) for FASE. Invited tutorials were provided by Ana-Lucia Varbanescu (University of Twente and University of Amsterdam, The Netherlands) on heterogeneous computing and Joost-Pieter Katoen (RWTH Aachen, Germany and University of Twente, The Netherlands) on probabilistic programming.

As part of the programme we had the second edition of TOOLympics, an event to celebrate the achievements of the various competitions or comparative evaluations in the field of ETAPS.

ETAPS 2023 was organized jointly by Sorbonne Université and Université Sorbonne Paris Nord. Sorbonne Université (SU) is a multidisciplinary, research-intensive and worldclass academic institution. It was created in 2018 as the merge of two first-class research-intensive universities, UPMC (Université Pierre and Marie Curie) and Paris-Sorbonne. SU has three faculties: humanities, medicine, and 55,600 students (4,700 PhD students; 10,200 international students), 6,400 teachers, professor-researchers and 3,600 administrative and technical staff members. Université Sorbonne Paris Nord is one of the thirteen universities that succeeded the University of Paris in 1968. It is a major teaching and research center located in the north of Paris. It has five campuses, spread over the two departments of Seine-Saint-Denis and Val d'Oise: Villetaneuse, Bobigny, Saint-Denis, the Plaine Saint-Denis and Argenteuil. The university has more than 25,000 students in different fields, such as health, medicine, languages, humanities, and science. The local organization team consisted of Fabrice Kordon (general co-chair), Laure Petrucci (general co-chair), Benedikt Bollig (workshops), Stefan Haar (workshops), Étienne André (proceedings and tutorials), Céline Ghibaudo (sponsoring), Denis Poitrenaud (web), Stefan Schwoon (web), Benoît Barbot (publicity), Nathalie Sznajder (publicity), Anne-Marie Reytier (communication), Hélène Pétridis (finance) and Véronique Criart (finance).

ETAPS 2023 is further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), EASST (European Association of Software Science and Technology), Lip6 (Laboratoire d'Informatique de Paris 6), LIPN (Laboratoire d'informatique de Paris Nord), Sorbonne Université, Université Sorbonne Paris Nord, CNRS (Centre national de la recherche scientifique), CEA (Commissariat à l'énergie atomique et aux énergies alternatives), LMF (Laboratoire méthodes formelles), and Inria (Institut national de recherche en informatique et en automatique).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König (Duisburg), Thomas Noll (Aachen), Caterina Urban (Inria), Jan Křetínský (Munich), and Lenore Zuck (Chicago).

Other members of the steering committee are: Dirk Beyer (Munich), Luís Caires (Lisboa), Ana Cavalcanti (York), Bernd Finkbeiner (Saarland), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Naoki Kobayashi (Tokyo), Fabrice Kordon (Paris), Laura Kovács (Vienna), Orna Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), Andrzej Murawski (Oxford), Laure Petrucci (Paris), Elizabeth Polgreen (Edinburgh), Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella (Edinburgh), Natasha Sharygina (Lugano), Pawel Sobocinski (Tallinn), Sebastián Uchitel (London and Buenos Aires), Andrzej Wasowski (Copenhagen), Stephanie Weirich (Pennsylvania), Thomas Wies (New York), Anton Wijs (Eindhoven), and James Worrell (Oxford).

I would like to take this opportunity to thank all authors, keynote speakers, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2023.

Finally, a big thanks to Laure and Fabrice and their local organization team for all their enormous efforts to make ETAPS a fantastic event.

April 2023 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

## Preface

We are pleased to present the proceedings of TACAS 2023, the 29th edition of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems held as part of the 26th European Joint Conferences on Theory and Practice of Software (ETAPS 2023), April 24–28, 2023 in Paris, France. TACAS brings together a community of researchers, developers, and end-users who are broadly interested in rigorous algorithmic techniques for the construction and analysis of systems. The conference is a venue that interleaves various disciplines including formal verification of software and hardware systems, static analysis, program synthesis, verification of machine learning/autonomous systems, probabilistic programming, SAT/SMT solving, constraint solving, static analysis, automated theorem proving and Cyber-Physical Systems.

There were five submission categories for TACAS 2023:


Regular research, case study, and regular tool papers were restricted to a total of sixteen pages, and tool demonstration papers to six pages, exclusive of references.

This year 169 papers were submitted to TACAS, consisting of 119 regular research papers, 34 regular tool and case study papers, and 16 tool demonstration papers. Each paper was reviewed by three Program Committee (PC) members, who made use of subreviewers. As a result, the PC accepted in total 62 papers, among which there were 45 regular papers, 11 regular tool/case-study papers and 6 tool demonstration papers. The PC members were pleasantly surprised by an unusually large number of strong submissions. Almost all accepted papers had either all positive reviews or a "championing" program committee member who argued in favor of accepting the paper. Furthermore, all accepted papers had a positive average score. One paper was accepted conditionally and successfully "shepherded" by the PC.

Similarly to previous years, it was possible to submit an artifact alongside a paper, which was mandatory for regular tool and tool demonstration papers. An artifact might consist of tools, models, proofs, or other data required for validation of the results of the paper. The Artifact Evaluation Committee (AEC) reviewed the artifacts based on their documentation, ease of use, and, most importantly, whether the results presented in the corresponding paper could be accurately reproduced. The evaluation was carried out using a standardized virtual machine to ensure consistency of the results, except for 4 artifacts that had special hardware or software requirements. The evaluation had two rounds. The first round was carried out in parallel with the work of the PC and evaluated the artifacts for all the submitted regular tool and tool demo papers. The judgment of the AEC was communicated to the PC and weighed in their discussion (the PC rejected a total of 4 papers in this phase). The second round took place after the paper acceptance notifications were sent out so the authors of accepted research and case-study papers could submit their artifacts. In both rounds, the AEC provided 3 reviews per artifact and communicated with the authors to resolve apparent technical issues. In total, 69 artifacts were submitted (51 in the first round and 18 in the second), and the AEC evaluated a total of 64 artifacts regarding their availability, functionality, and/or reusability. Finally, among the 62 accepted papers, the AEC awarded 32 functional badges, 21 reusable badges, and 33 available badges. Such badges appear on the first page of each paper to certify the properties of each artifact.

As a separate conference track, TACAS 2023 hosted the 12th Competition on Software Verification (SV-COMP 2023). SV-COMP is the annual comparative evaluation of tools for automatic software verification and witness validation. The TACAS proceedings contain a selection of 13 short papers that describe participating verification systems and a report presenting the results of the competition. These papers were reviewed by a separate program committee (the competition jury); each of the papers was assessed by at least three reviewers. A total of 52 verification systems were systematically evaluated, with 34 developer teams from ten countries, including five submissions from industry. Two sessions in the TACAS program were reserved for the competition: presentations by the competition chair and the participating development teams in the first session and an open community meeting in the second session.

We would like to thank all the people who helped to make TACAS 2023 successful. First, we would like to thank the authors for submitting their papers to TACAS 2023. The PC members and additional reviewers did a great job in reviewing papers: they contributed informed and detailed reports and engaged in the PC discussions. We also thank the steering committee, and especially its chair, Joost-Pieter Katoen, for his valuable advice. Lastly, we would like to thank the overall organization team of ETAPS 2023.

April 2023 Sriram Sankaranarayanan Natasha Sharygina Grigory Fedyukovich Sergio Mover Dirk Beyer

## Organization

## Program Committee Chairs


## Program Committee

Ezio Bartocci TU Wien, Austria Armin Biere Freiburg, Germany Nikolaj Bjørner Microsoft, USA Chuchu Fan MIT, USA Khalil Ghorbal Inria, France Laura Kovacs TU Wien, Austria

Christel Baier TU Dresden, Germany Haniel Barbosa Universidade Federal de Minas Gerais, Brazil Dirk Beyer LMU Munich, Germany Roderick Bloem Graz University of Technology, Austria Ahmed Bouajjani IRIF, Université Paris Cité, France Hana Chockler King's College London, UK Alessandro Cimatti Fondazione Bruno Kessler, Italy Rance Cleaveland University of Maryland, USA Javier Esparza TU Munich, Germany Grigory Fedyukovich Florida State University, USA Bernd Finkbeiner CISPA Helmholtz Center for Information Security, Germany Martin Fränzle Carl von Ossietzky Universität Oldenburg, Germany Laure Gonnord Grenoble-INP/LCIS, France Orna Grumberg Technion - Israel Institute of Technology, Israel Kim Guldstrand Larsen Aalborg University, Denmark Arie Gurfinkel University of Waterloo, Canada Ranjit Jhala University of California, San Diego, USA Alexander Kulikov St. Petersburg Department of Steklov Institute of Mathematics, Russia Bettina Könighofer Graz University of Technology, Austria Wenchao Li Boston University, USA Sergio Mover Ecole Polytechnique, France Peter Müller ETH Zurich, Switzerland Kedar Namjoshi Nokia Bell Labs, USA Aina Niemetz Stanford University, USA Corina Pasareanu CMU, NASA, KBR, USA Nir Piterman University of Gothenburg, Sweden


## Artifact Evaluation Committee Chairs


## Artifact Evaluation Committee



## Program Committee and Jury—SV-COMP

Dirk Beyer (Chair) LMU Munich, Germany Viktor Malík (2LS) TU Brno, Czechia Lei Bu (BRICK) Nanjing University, China Marek Chalupa (Bubaak) ISTA, Austria Michael Tautschnig (CBMC) Queen Mary University London, UK Henrik Wachowitz (CPAchecker) LMU Munich, Germany Hernán Ponce de León (Dartagnan) Huawei Dresden Research, Germany Fei He (Deagle) Tsinghua University, China Fatimah Aljaafari (EBF) University of Manchester, UK Rafael Sá Menezes (ESBMC-kind) University of Manchester, UK Martin Spiessl (Frama-C-SV) LMU Munich, Germany Falk Howar (GDart, GDart-LLVM) TU Dortmund, Germany Simmo Saan (Goblint) University of Tartu, Estonia William Leeson (Graves-CPA, Graves-Par) University of Virginia, USA Soha Hussein (Java-Ranger) University of Minnesota, USA Peter Schrammel (JBMC) University of Sussex/Diffblue, UK Gidon Ernst (Korn) LMU Munich, Germany Tong Wu (LF-checker) University of Manchester, UK Vesal Vojdani (Locksmith) University of Tartu, Estonia Lei Bu (MLB) Nanjing University, China Raphaël Monat (Mopsa) Inria and University of Lille, France Cedric Richter (PeSCo-CPA) University of Oldenburg, Germany Jie Su (PIChecker) Xidian University, China Marek Trtik (Symbiotic) Masaryk University, Brno, Czechia Levente Bajczi (Theta) Budapest University of Technology and Economics, Hungary


## Steering Committee

Dirk Beyer LMU Munich, Germany

## Additional Reviewers

Abd Alrahman, Yehia Ahmad, H. M. Sabbir An, Jie Asarin, Eugene Azzopardi, Shaun Bacci, Giorgio Baier, Daniel Balakrishnan, Gogul Balasubramanian, A. R. Baumeister, Jan Becchi, Anna Ben Shimon, Yoav Berger, Guillaume Beutner, Raven Bily, Aurel Blicha, Martin Bombardelli, Alberto Brieger, Marvin Brizzio, Matías Bunk, Thomas Caillaud, Benoît Cano Córdoba, Filip

Rance Cleaveland University of Maryland, USA Holger Hermanns Universität des Saarlandes, Germany Joost-Pieter Katoen (Chair) RWTH Aachen, Germany and Universiteit Twente, Netherlands Kim G. Larsen Aalborg University, Denmark Bernhard Steffen Technische Universität Dortmund, Germany

> Ceresa, Martin Ceska, Milan Chen, Mingshuai Chen, Xin Chen, Yilei Chiari, Michele Czerner, Philipp Dardinier, Thibault Dawson, Charles De Masellis, Riccardo Debrestian, Darin Di Stefano, Luca Egolf, Derek Elad, Neta Elashkin, Andrey Esen, Zafer Fazekas, Katalin Feng, Shenghua Ferres, Bruno Fiedor, Jan Fleury, Mathias Fontaine, Pascal

Frenkel, Eden Frenkel, Hadar Froleyks, Nils Fu, Feisi Garcia-Contreras, Isabel Garg, Kunal Georgiou, Pamina Gianola, Alessandro Gigerl, Barbara Goorden, Martijn Gorostiaga, Felipe Goyal, Srajan Griggio, Alberto Grosen, Thomas Møller Gstrein, Bernhard Gupta, Ashutosh Habermehl, Peter Hader, Thomas Hadzic, Vedad Hagemann, Willem Hamza, Ameer Haring, Johannes Hausmann, Daniel Havlena, Vojtěch Hermo, Montserrat Holík, Lukáš Hozzová, Petra Huang, Chao Huang, Chengchao Hyvärinen, Antti Itzhaky, Shachar Jacobs, Swen Jaeger, Manfred Jansen, David N. Jensen, Nicolaj Østerby Jha, Prabhat Jonas, Martin Junges, Sebastian Kaki, Gowtham Kaufmann, Daniela Kenison, George Kettl, Matthias Khalimov, Ayrat Kifetew, Fitsum Kiourti, Panagiota Klüppelholz, Sascha

Kröger, Paul Käfer, Nikolai Lal, Akash Larrauri, Alberto Larraz, Daniel Lazic, Marijana Le, Nham Lee, Nian-Ze Lengal, Ondrej Li, Renjue Lidell, David Liu, Jiaxiang Lopez-Miguel, Ignacio D. Luttenberger, Michael Macías, Fernando Maderbacher, Benedikt McClurg, Jedidiah Meng, Yue Metzger, Niklas Michelland, Sebastien Monniaux, David Moosbrugger, Marcel Nadel, Alexander Nam, Seunghyeon Nesterini, Eleonora Neufeld, Emery Nickovic, Dejan Noetzli, Andres Oliveira Da Costa, Ana Otoni, Rodrigo Parthasarathy, Gaurav Paxian, Tobias Pluska, Alexander Poli, Federico Pontiggia, Francesco Prandi, Davide Pranger, Stefan Preiner, Mathias Radanne, Gabriel Rakow, Astrid Rappoport, Omer Rauh, Andreas Rawson, Michael Rebola Pardo, Adrian Reynolds, Andrew Riley, Daniel

Rodriguez, Andoni Rogalewicz, Adam Román Calvo, Enrique Rubio, Rubén Rutledge, Kwesi Sallinger, Sarah Sankaranarayanan, Sriram Schlichtkrull, Anders Schoisswohl, Johannes Schultz, William Schupp, Stefan Schwammberger, Maike Sextl, Florian Siber, Julian So, Oswin Sogokon, Andrew Spiessl, Martin Steen, Alexander Su, Yusen Susi, Angelo Síč, Juraj Tappler, Martin Thibault, Joan Ting, Gan Treml, Lilly Maria Trivedi, Ashutosh

Turrini, Andrea Varanasi, Sarat Chandra Vediramana Krishnan, Hari Govind Visconti, Ennio Wachowitz, Henrik Wand, Michael Wardega, Kacper Weininger, Maximilian Wendler, Philipp Wienhöft, Patrick Wu, Hao Wu, Haoze Xue, Anton Yadav, Drishti Yang, Pengfei Yang, Ruixiao Yu, Chenning Yu, Mingxin Zavalia, Lucas Zhan, Bohua Zhang, Hanwei Zhang, Songyuan Zhou, Weichao Zhou, Yuhao Zimmermann, Martin Zlatkin, Ilia

## Contents – Part I

#### Invited Talk


#### Machine Learning/Neural Networks



## Constraint Solving/Blockchain


### Markov Chains/Stochastic Control


### Verification


xviii Contents – Part I


## Contents – Part II

#### Tool Demos




#### Tools (Regular Papers)



## Graphs/Probabilistic Systems


### Runtime Monitoring/Program Analysis


### 12th Competition on Software Verification — SV-COMP 2023




## **Invited Talk**

## A Learner-Verifier Framework for Neural Network Controllers and Certificates of Stochastic Systems<sup>∗</sup>

Krishnendu Chatterjee<sup>1</sup> , Thomas A. Henzinger1() , Mathias Lechner<sup>2</sup> , and Ðorđe Žikelić<sup>1</sup>

1 Institute of Science and Technology Austria (ISTA), Klosterneuburg, Austria {krishnendu.chatterjee,tah,djordje.zikelic}@ist.ac.at

<sup>2</sup> Massachusetts Institute of Technology (MIT), Cambridge, MA, USA mlechner@mit.edu

Abstract. Reinforcement learning has received much attention for learning controllers of deterministic systems. We consider a learner-verifier framework for stochastic control systems and survey recent methods that formally guarantee a conjunction of reachability and safety properties. Given a property and a lower bound on the probability of the property being satisfied, our framework jointly learns a control policy and a formal certificate to ensure the satisfaction of the property with a desired probability threshold. Both the control policy and the formal certificate are continuous functions from states to reals, which are learned as parameterized neural networks. While in the deterministic case, the certificates are invariant and barrier functions for safety, or Lyapunov and ranking functions for liveness, in the stochastic case the certificates are supermartingales. For certificate verification, we use interval arithmetic abstract interpretation to bound the expected values of neural network functions.

Keywords: Learning-based control · Stochastic systems · Martingales. · Formal verification

## 1 Introduction

Learning-based control and verification of learned controllers. Learning-based control and reinforcement learning (RL) were empirically demonstrated to have enormous potential to solve highly non-linear control tasks. However, their deployment in safety-critical scenarios such as autonomous driving or healthcare requires safety assurances. Most safety-aware RL algorithms optimize expected reward while only empirically trying to maximize safety probability. This together with the non-explainable nature of neural network controllers obtained via deep RL raise questions about the trustworthiness of learning-based methods for safety-critical applications [9,27]. To that end, formal verification of learned

<sup>∗</sup>This work was supported in part by the ERC-2020-AdG 101020093, ERC CoG 863818 (FoRM-SMArt) and the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 665385.

https://doi.org/10.1007/978-3-031-30823-9\_1

controllers as well as learning-based control with formal safety guarantees have become very active research topics.

Learning certificate functions. A classical approach to formally proving properties of dynamical systems is to compute a certificate function. A certificate function [26] is a function that assigns real values to system states and its defining conditions imply satisfaction of the property. Thus, in order to prove the property of interest, it suffices to compute a certificate function for that property. For instance, Lyapunov functions [46] and barrier functions [50] are standard certificate functions for proving reachability of some target set and avoidance of some unsafe set of system states, respectively, when the system dynamics are deterministic. While both Lyapunov and barrier functions are well-studied concepts in dynamical systems theory, early methods for their computation either required designing the certificates by hand or using computationally intractable numerical procedures. A more recent approach reduces certificate computation to a semi-definite programming problem by using sum-of-squares (SOS) techniques [33,49,37]. However, a limitation of this approach is that it is only applicable to polynomial systems and computation of polynomial certificate functions, whereas it is not applicable to systems with general non-linearities. Moreover, SOS methods do not scale well with the dimension of the system.

Learning-based methods are a promising approach to overcome these limitations and they have received much attention in recent years. These methods jointly learn a neural network control policy and a neural network certificate function, e.g. a Lyapunov function [53,18,3,17] or a barrier function [38,58,52,1], depending on the property of interest. The neural network certificate is then formally verified, ensuring that these methods provide formal guarantees. Both learning and verification procedures developed for verifying neural network certificates are not restricted to polynomial dynamical systems. See [26] for an overview of existing learning-based control methods that learn a certificate function to verify a system property in deterministic dynamical systems.

Prior works – deterministic dynamical systems. While the above works present significant advancements in learning-based control and verification of dynamical systems, they are predominantly restricted to deterministic dynamical systems. In other words, they assume that they have access to the exact dynamics function according to which the system evolves. However, for most control tasks, the underlying models used by control methods are imperfect approximations of real systems inferred from observed data. Thus, control and verification methods should also account for model uncertainty due to the noise in observed data and the approximate nature of model inference.

This survey – stochastic dynamical systems. In this work, we survey recent developments in learning-based methods for control and verification of discrete-time stochastic dynamical systems, based on [44,68]. Stochastic dynamical systems use probability distributions to quantify and model uncertainty. In stochastic dynamical systems, given a property of interest and a probability parameter p ∈ [0, 1], the goal is to learn a control policy and a formal certificate which guarantees that the system under the learned policy satisfies the property of interest with probability at least p.

Supermartingale certificate functions. Lyapunov functions and barrier functions can be used to prove properties in deterministic dynamical systems, however they are not applicable to stochastic dynamical systems and do not allow reasoning about the probability of a property being satisfied. Instead, the learningbased methods of [44,68] use supermartingale certificate functions to formally prove properties in stochastic systems. Supermartingales are a class of stochastic processes that decrease in expected value at every time step [66]. Their nice convergence properties and concentration bounds allow their use in designing certificate functions for stochastic dynamical systems. In particular, ranking supermartingales (RSMs) [15,44] were used to verify probability 1 reachability and stochastic barrier functions (SBFs) [50] were used to verify safety with the specified probability p ∈ [0, 1]. Reach-avoid supermartingales (RASMs) [68] unify and extend these two concepts and were used to verify reach-avoidance properties with the specified probability p ∈ [0, 1], i.e. a conjunction of reachability and safety properties. We define and compare these concepts in Section 3.

Fig. 1: Schematic illustration of the learner-verifier loop.

Learner-verifier framework for stochastic dynamical systems. In Section 4, we then present a learner-verifier framework of [44,68] for learning-based control and for the verification of learned controllers in stochastic dynamical systems in a counterexample guided inductive synthesis (CEGIS) fashion [55]. The algorithm jointly learns a neural network control policy and a neural network supermartingale certificate function. It consists of two modules – the learner, which learns a policy and a supermartingale certificate function candidate, and the verifier, which then formally verifies the candidate supermartingale certificate function. If the verification step fails, the verifier computes counterexamples and passes them back to the learner, which tries to learn a new candidate. This loop is repeated until a candidate is successfully verified, see Fig. 1.

This framework builds on the existing learner-verifier methods for learningbased control in deterministic dynamical systems [18,2,26]. However, the extension of this framework to stochastic dynamical systems and the synthesis of supermartingale certificate functions is far from straightforward. In particular, the methods of [18,2] use knowledge of the deterministic dynamics function to reduce the verification task to a decision procedure and use an off-the-shelf solver. However, verification of the expected decrease condition of supermartingale certificates by reduction to a decision procedure would require being able to compute a closed-form expression of the expected value of a neural network function over a probability distribution and provide it to the solver. It is not clear how the closed-form expression can be computed, and it is not known whether the closed-form expression exists in the general case.

This challenge is solved by using a method for efficient computation of tight upper and lower bounds on the expected value of a neural network function. The verifier module then verifies the expected decrease condition by discretizing the state space and formally verifying a slightly stricter condition at the discretization points by using the computed expected value bounds. By carefully choosing the mesh of the discretization and adding an additional error term, we obtain a sound verification method applicable to general Lipschitz continuous systems. The expected value bound computation for neural network functions relies on interval arithmetic and abstract interpretation, and since it is of independent interest, we discuss it in detail in Section 5. We are not aware of any existing methods that tackle this problem.

Extension to general stochastic certificates. We conclude this survey with a discussion of possible extensions of the learner-verifier framework in Section 6 and of related work in Section 7.

## 2 Preliminaries

We consider discrete-time stochastic dynamical systems defined via

$$\mathbf{x}\_{t+1} = f(\mathbf{x}\_t, \mathbf{u}\_t, \omega\_t), \mathbf{x}\_0 \in \mathcal{X}\_0.$$

The function f : X × U × N → X is the dynamics function of the system and t ∈ N<sup>0</sup> is the time index. We use X ⊆ R <sup>m</sup> to denote the system state space, U ⊆ R <sup>n</sup> the control action space and N ⊆ R p the stochastic disturbance space. For each t ∈ N0, x<sup>t</sup> ∈ X the state of the system, u<sup>t</sup> ∈ U the action and ω<sup>t</sup> ∈ N the stochastic disturbance vector at time t. The set X<sup>0</sup> ⊆ X is the set of initial states. In each time step, u<sup>t</sup> is chosen according to a control policy π : X → U, i.e. u<sup>t</sup> = π(xt), and ω<sup>t</sup> is sampled according to some specified probability distribution d over R p . The dynamics function f, control policy π and probability distribution d together define a stochastic feedback loop system.

A trajectory of the system is a sequence (xt, ut, ωt)t∈N<sup>0</sup> such that, for each t ∈ N0, we have u<sup>t</sup> = π(xt), ω<sup>t</sup> ∈ support(d) and xt+1 = f(xt, ut, ωt). For each initial state x<sup>0</sup> ∈ X , the system induces a Markov process. This gives rise to the probability space over the set of all trajectories of the system that start in x<sup>0</sup> [51]. We denote the probability measure and the expectation in this probability space by P<sup>x</sup><sup>0</sup> and E<sup>x</sup><sup>0</sup> , respectively.

Assumptions. We assume that X ⊆ R <sup>m</sup>, X<sup>0</sup> ⊆ R <sup>m</sup>, U ⊆ R <sup>n</sup> and N ⊆ R <sup>p</sup> are all Borel-measurable. This is necessary for the probability space of the set of all system trajectories starting in some initial state to be mathematically well-defined. We also assume that X ⊆ R <sup>m</sup> is compact (i.e. closed and bounded) and that the dynamics function f is Lipschitz continuous, which are common assumptions in control theory. Finally, we assume that the probability distribution d is a product of independent univariate probability distributions, which is necessary for efficient sampling and expected value computation.

#### 2.1 Brief Overview of Martingale Theory

In this subsection, we provide a brief overview of definitions and results from martingale theory that lie at the core of formal reasoning about supermartingale certificate functions. We assume that the reader is familiar with the mathematical definitions of probability space, measurability and random variables, see [66] for the necessary background. The results in this subsection will help in building an intuition on supermartingale certificate functions, but omitting them would not prevent the reader from following the rest of this paper.

Probability space. A probability space is a triple (Ω, F, P) where Ω is a state space, F is a sigma-algebra and P is a probability measure which is required to satisfy Kolmogorov axioms [66]. A random variable is a function X : Ω → R that is F-measurable. We use E[X] to denote the expected value of X. A (discretetime) stochastic process is a sequence (Xi)<sup>∞</sup> <sup>i</sup>=0 of random variables in (Ω, F, P). Conditional expectation. Let X be a random variable in a probability space (Ω, F, P). Given a sub-σ-algebra F <sup>0</sup> ⊆ F, a conditional expectation of X given F 0 is an F 0 -measurable random variable Y such that, for each A ∈ F<sup>0</sup> , we have

$$\mathbb{E}[X \cdot \mathbb{I}(A)] = \mathbb{E}[Y \cdot \mathbb{I}(A)].$$

Here, I(A) : Ω → {0, 1} is an indicator function of A defined via I(A)(ω) = 1 if ω ∈ A, and I(A)(ω) = 0 if ω 6∈ A. Intuitively, conditional expectation of X given F 0 is an F 0 -measurable random variable that behaves like X whenever its expected value is taken over an event in F 0 . Conditional expectation of a random variable X given F 0 is guaranteed to exist if X is real-valued and nonnegative [66]. Moreover, for any two conditional expectations Y and Y <sup>0</sup> of X given F 0 , we have that P[Y = Y 0 ] = 1. Therefore, the conditional expectation is almost-surely unique and we may pick one such random variable as a canonical conditional expectation and denote it by E[X | F<sup>0</sup> ].

Supermartingales. Let (Ω, F, P) be a probability space and F<sup>0</sup> ⊆ F<sup>1</sup> ⊆ · · · ⊆ F be an increasing sequence of sub-σ-algebras in F with respect to inclusion. A nonnegative supermartingale with respect to (Fi)<sup>∞</sup> <sup>i</sup>=0 is a stochastic process (Xi)<sup>∞</sup> i=0 such that each X<sup>i</sup> is Fi-measurable, and Xi(ω) ≥ 0 and E[Xi+1 | F<sup>i</sup> ](ω) ≤ Xi(ω) hold for each ω ∈ Ω and i ≥ 0. Intuitively, the second condition says that the expected value of Xi+1 given the value of X<sup>i</sup> has to decrease. This condition is formalized by using conditional expectation.

The following two results that will be key technical ingredients in our design of supermartingale certificate functions. The first theorem shows that nonnegative supermartingales have nice convergence properties and converge almostsurely to some finite value. The second theorem bounds the probability that the value of the supemartingale ever exceeds some threshold, and it will allow us to bound from above the probability of occurrence of some bad event.

Theorem 1 (Supermartingale convergence theorem [66]). Let (Xi)<sup>∞</sup> i=0 be a nonnegative supermartingale with respect to (Fi)<sup>∞</sup> <sup>i</sup>=0. Then, there exists a random variable X<sup>∞</sup> in (Ω, F, P) to which the supermartingale converges to with probability 1, i.e. P[limi→∞ X<sup>i</sup> = X∞] = 1.

Theorem 2 ([41]). Let (Xi)<sup>∞</sup> <sup>i</sup>=0 be a nonnegative supermartingale with respect to (Fi)<sup>∞</sup> <sup>i</sup>=0. Then, for every real λ > 0, we have P[supi≥<sup>0</sup> X<sup>i</sup> ≥ λ] ≤ E[X0]/λ.

#### 2.2 Problem Statement

We now formally define the properties and control tasks that we focus on in this work. In what follows, let Xt, X<sup>u</sup> ⊆ X be disjoint Borel-measurable sets and p ∈ [0, 1] be a lower bound on the probability with which the system under the learned controller needs to satisfy the property:


## 3 Supermartingale Certificate Functions

We now overview three classes of supermartingale certificate functions that formally prove reachability, safety and reach-avoidance properties. Supermartingale certificate functions do not refer to a single class of certificate functions. Rather, we use this term to refer to all certificate functions that exhibit a supermartingale-like behavior and can formally verify properties in stochastic dynamical systems. In what follows, we assume that the control policy π is fixed. In the following section, we will then present a learner-verifier framework for jointly learning a control policy and a supermartingale certificate function.

RSMs for probability 1 reachability. We start with ranking supermartingales (RSMs), which can prove probability 1 reachability of some target set Xt. Intuitively, an RSM is a continuous function that maps system states to nonnegative real values and is required to strictly decrease in expectation by some > 0 in every time step until the target X<sup>t</sup> is reached. Due to the strict expected decrease as well as the Supermartingale Convergence Theorem (Theorem 1), one can show that the existence of an RSM guarantees that the system under policy π reaches X<sup>t</sup> with probability 1. RSMs can be viewed as a stochastic extension of Lyapunov functions. Note that RSMs can only be used to prove probability 1 reachability, but cannot be used to reason about probabilistic reachability. RSMs were originally used for proving almost-sure termination in probabilistic programs [15] and were used to certify probability 1 reachability in stochastic dynamical systems in [44].

Definition 1 (Ranking supermartingales [44]). Let X<sup>t</sup> ⊆ X be a target set. A continuous function V : X → R is a ranking supermartingale (RSM) with respect to X<sup>t</sup> if it satisfies:


Theorem 3 ([44]). Suppose that there exists an RSM with respect to Xt. Then, for every x<sup>0</sup> ∈ X0, we have P<sup>x</sup><sup>0</sup> [Reach(Xt)] = 1.

SBFs for probabilistic safety. On the other hand, stochastic barrier functions (SBFs) can prove probabilistic safety. Given an unsafe set X<sup>u</sup> and probability p ∈ [0, 1), an SBF is also a continuous function mapping system states to nonnegative real values, which is required to decrease in expectation at each time step. However, unlike RSMs, the expected decrease need not be strict and there is no target set. In addition, its initial value must be at most 1, whereas its value upon reaching an unsafe set must be at least 1/(1 − p). Thus, for the system under policy π to violate the safety constraint, the value of the SBF needs to increase from at most 1 to at least 1/(1−p) even though it is required to decrease in expectation. The probability of this event can be bounded from above and shown to be at most 1−p by using Theorem 2. We highlight the assumption that p < 1, which is necessary for the safety constraint to be mathematically defined. As the name suggests, SBFs are a stochastic extension of barrier functions.

Definition 2 (Stochastic barrier functions [50]). Let X<sup>u</sup> ⊆ X be an unsafe set and p ∈ [0, 1). A continuous function V : X → R is a stochastic barrier function (SBF) with respect to X<sup>u</sup> and p if it satisfies:


Theorem 4 ([50]). Suppose that there exists an SBF with respect to X<sup>u</sup> and p. Then, for every x<sup>0</sup> ∈ X0, we have P<sup>x</sup><sup>0</sup> [Safe(Xu)] ≥ p.

RASMs for probabilistic reach-avoidance. Finally, reach-avoid supermartingales (RASMs) unify and extend RSMs and SBFs in the sense that they allow simultaneous reasoning about reachability and safety and proving a conjunction of

these properties, i.e. reach-avoid properties. Let X<sup>t</sup> and X<sup>u</sup> be disjoint target and unsafe sets and let p ∈ [0, 1). Similarly to SBFs, an RASM is a continuous nonnegative function which is required to be initially at most 1 but needs to attain a value that is at least 1/(1 − p) for the unsafe region to be reached. On the other hand, similarly to RSMs, it is required to strictly decrease in expectation by > 0 at every time step until either the target set X<sup>t</sup> or a state in which the value is at least 1/(1 − p) is reached. Thus, RASMs can be viewed as a stochastic extension of both Lyapunov functions and barrier functions, which combines the strict decrease of Lypaunov functions and the level-set reasoning of barrier functions.

Definition 3 (Reach-avoid supermartingales [68]). Let X<sup>t</sup> ⊆ X and X<sup>u</sup> ⊆ X be a target set and an unsafe set, respectively, and let p ∈ [0, 1] be a probability threshold. Suppose that either p < 1 or that p = 1 and X<sup>u</sup> = ∅. A continuous function V : X → R is a reach-avoid supermartingale (RASM) with respect to Xt, X<sup>u</sup> and p if it satisfies:


Theorem 5 ([68]). Suppose that there exists an RASM with respect to Xt, X<sup>u</sup> and p. Then, for every x<sup>0</sup> ∈ X0, we have P<sup>x</sup><sup>0</sup> [ReachAvoid(Xt, Xu)] ≥ p.

Note that RASMs indeed unify and generalize the definitions of RSMs and SBFs. First, by setting X<sup>u</sup> = ∅ and p = 1 (so 1/(1 − p) = ∞), RASMs reduce to RSMs as the Initial condition that can be enforced without loss of generality by rescaling. Second, by setting X<sup>t</sup> = ∅, RASMs reduce to SBFs. In this case, the Expected Decrease condition is strengthened as it requires strict decrease by > 0. However, the proof of Theorem 5 which we outline below also implies Theorem 4 and > 0 is only necessary to reason about the reachability of Xt.

We also note that RASMs strictly extend the applicability of RSMs, since RASMs can be used to prove reachability with any lower bound p ∈ [0, 1] on probability and not only probability 1 reachability. Indeed, if we set X<sup>u</sup> = ∅ and p ∈ [0, 1], in order to prove reachability of X<sup>t</sup> with probability at least p the RASMs require strict expected decrease in expectation by > 0 until either X<sup>t</sup> is reached or the RASM value exceeds 1/(1 − p) (with 1/(1 − p) = ∞ if p = 1).

In the rest of this section, we outline the proof of Theorem 5 that was presented in [68]. This proof also implies Theorem 3 and Theorem 4. We do this to highlight the connection of RSMs, SBFs and RASMs to the mathematical notion of supermartingale processes. We also do this to illustrate the tools from martingale theory that are used in proving soundness of supermatingale certificate functions, as we envision that they may be useful in designing supermatingale certificate functions for more general classes of properties.

Proof (proof sketch of Theorem 5). Here we outline the main ideas behind the proof, and for the full proof we refer the reader to [68]. Let x<sup>0</sup> ∈ X0. We need to show that P<sup>x</sup><sup>0</sup> [ReachAvoid(Xt, Xu)] ≥ p. To do this, we consider the probability space (Ω<sup>x</sup><sup>0</sup> , F<sup>x</sup><sup>0</sup> , P<sup>x</sup><sup>0</sup> ) of trajectories that start in x<sup>0</sup> and for each time step t ∈ N<sup>0</sup> define a random variable in this probability space via

$$X\_t(\rho) = \begin{cases} V(\mathbf{x}\_t), & \text{if } \mathbf{x}\_i \notin \mathcal{X}\_t \text{ and } V(\mathbf{x}\_i) < \frac{1}{1-p} \text{ for each } 0 \le i \le t \\ 0, & \text{if } \mathbf{x}\_i \in \mathcal{X}\_t \text{ for some } 0 \le i \le t, V(\mathbf{x}\_j) < \frac{1}{1-p} \text{ for each } 0 \le j \le i \\ \frac{1}{1-p}, & \text{otherwise} \end{cases}$$

for each trajectory ρ = (xt, ut, ωt)t∈N<sup>0</sup> ∈ Ω<sup>x</sup><sup>0</sup> . Hence, (Xt)<sup>∞</sup> <sup>t</sup>=0 defines a stochastic process whose value at each time step is equal to the value of V at the current system state unless either the target set X<sup>t</sup> has been reached after which future values of X<sup>t</sup> are set to 0, or a state in which V exceeds 1/(1−p) has been reached after which future values of X<sup>t</sup> are set to 1/(1−p). It can be shown that (Xt)<sup>∞</sup> t=0 is a nonnegative supermartingale (Ω<sup>x</sup><sup>0</sup> , F<sup>x</sup><sup>0</sup> , P<sup>x</sup><sup>0</sup> ). This claim can be proved by using the Nonnegativity and the Expected Decrease condition of RASMs. Here we do not yet need that the expected decrease is strict, i.e. ≥ 0 in the Expected Decrease condition of RASMs is sufficient.

Since (Xt)<sup>∞</sup> <sup>t</sup>=0 is a nonnegative supermartingale, substituting λ = 1/(1 − p) into the inequality in Theorem 2 shows that

$$\mathbb{P}\_{\mathbf{x}\_0} \left[ \sup\_{i \ge 0} X\_i \ge \frac{1}{1 - p} \right] \le (1 - p) \cdot \mathbb{E}\_{\mathbf{x}\_0} [X\_0] \le 1 - p.$$

The second inequality follows since X0(ρ) = V (x0) ≤ 1 for every ρ ∈ Ω<sup>x</sup><sup>0</sup> by the Initial condition of RASMs. Hence, by the Safety condition of RASMs it follows that the system under policy π reaches the unsafe set X<sup>u</sup> with probability at most 1 − p. Note that here we can already conclude the claim of Theorem 4.

Finally, as (Xt)<sup>∞</sup> <sup>t</sup>=0 is a nonnegative supermartingale, by Theorem 1 its value converges with probability 1. One can then prove that this value has to be either 0 or ≥ 1/(1 − p) by using the fact that the expected decrease in the Expected Decrease condition of RASMs is strict. But we showed above that a state in which V is ≥ 1/(1 − p) is reached with probability at most 1 − p. Hence, the probability that the system under policy π reaches the target set X<sup>t</sup> without reaching the unsafe set X<sup>u</sup> is at least p, i.e. P<sup>x</sup><sup>0</sup> [ReachAvoid(Xt, Xu)] ≥ p. ut

## 4 Learner-Verifier Framework for Stochastic Systems

We now present the learner-verifier framework of [44,68] for the learning-based control and verification of learned controllers in stochastic dynamical systems. We focus on the probabilistic reach-avoid problem, assume that we are given a target set Xt, unsafe set X<sup>u</sup> and a probability parameter p ∈ [0, 1], and learn a control policy π and an RASM which certifies that P<sup>x</sup><sup>0</sup> [ReachAvoid(Xt, Xu)] ≥ p for all x<sup>0</sup> ∈ X0. The algorithm for learning RSMs and SBFs can be obtained analogously, since we showed that RASMs unify and generalize RSMs and SBFs.

The algorithm behind the learner-verifier framework consists of two modules – the learner, which learns a neural network control policy π<sup>θ</sup> and a neural network supermartingale certificate function Vν, and the verifier, which then formally verifies the learned candidate function. If the verification step fails, the verifier produces counterexamples that are passed back to the learner to fine-tune its loss function. Here, θ and ν are vectors of neural network parameters. The loop is repeated until either a certificate function is successfully verified, or some specified timeout is reached. By incorporating feedback from the verifier, the learner is able to tune the policy and the certificate function towards ensuring that the resulting policy meets the desired reach-avoid specification.

Applications. As outlined above, the learner-verifier framework can be used for learning-based control with formal guarantees that a property of interest is satisfied by jointly learning a control policy and a supermartingale certificate function for the property. On the other hand, it can also be used to formally verify a previously learned control policy by fixing policy parameters and only learning a supermartingale certificate function. Finally, if one uses a different method to learn a policy that turns out to violate the desired property, one can use the learner-verifier framework to fine-tune an unsafe policy towards repairing it and obtaining a safe policy for which a supermartingale certificate function certifies that the property of interest is satisfied.

## 4.1 Algorithm Initialization

As mentioned in Section 1, the key challenge for the verifier is to check the Expected Decrease condition of supermartingale certificates. Our algorithm solves this challenge by discretizing the state space and verifying a slightly stricter condition at discretization vertices which we show to imply the Expected Decrease condition over the whole region required by Definition 3. On the other hand, learning two neural networks in parallel while simultaneously optimizing several objectives can be unstable due to inherent dependencies between two networks. Thus, proper initialization of networks is important. We allow all neural network architectures so long as all activation functions are continuous functions. Furthermore, we apply the softplus activation function to the output neuron of Vν, in order to ensure that the value of V<sup>ν</sup> is always nonnegative.

Discretization. A discretization X˜ of X with mesh τ > 0 is a set of states such that, for every x ∈ X , there exists a state x˜ ∈ X˜ such that ||x − x˜||<sup>1</sup> < τ . The algorithm takes mesh τ as a parameter and computes a finite discretization X˜ with mesh τ by simply taking a hyper-rectangular grid of the sufficiently small cell size. Since X is compact, this yields a finite discretization.

Network initialization. The policy network π<sup>θ</sup> is initalized by running proximal policy optimization (PPO) [54] on the Markov decision process (MDP) defined by the stochastic dynamical system with a reward function r<sup>t</sup> = 1[Xt](xt)−[Xu](xt).

The discretization X˜ is used to define three sets of states which are then used by the learner to initialize the certificate network V<sup>ν</sup> and to which counterexamples computed by the verifier will be added later. In particular, the algorithm initializes Cinit = X ∩ X ˜ <sup>0</sup>, Cunsafe = X ∩ X ˜ <sup>u</sup> and Cdecrease = X ∩˜ (X \Xt).

#### 4.2 The Learner module

The Learner updates the parameters θ of the policy and ν of the neural network certificate function candidate V<sup>ν</sup> with the objective of the candidate satisfying the supermartingle certificate conditions. The parameter updates happen incrementally via gradient descent of the form θ ← θ−α ∂L(θ,ν) ∂θ and ν ← ν−α ∂L(θ,ν) ∂ν , where α > 0 is the learning rate and L is a loss function that corresponds to a differentiable optimization objective of the supermartingle certificate conditions. Ideally, the global minimum of L should correspond to a policy π and a neural network V<sup>ν</sup> that fulfills all certificate conditions. In practice, however, due to the non-convexity of the network Vν, gradient descent is not guaranteed to converge to the global minimum. As a result, the learner is not monotone, i.e. a new iteration does not guarantee improvement over the previous iteration. The training process usually applies a fixed number of gradient descent iterations or, alternatively, continues until a certain threshold on the loss value is achieved.

Loss functions. The particular type of loss function L depends on the type of supermartingale certificate function that should be learned by the network, but is of the general form

$$\mathcal{L}(\theta, \nu) = \mathcal{L}\_{\text{Certificate}}(\theta, \nu) + \lambda \cdot \left(\mathcal{L}\_{\text{Lip-chiitz}}(\theta) + \mathcal{L}\_{\text{Lip-chiitz}}(\nu)\right), \tag{1}$$

where LCertificate is the specification-specific loss. The auxiliary loss terms LLipschitz regularize the training to obtain networks π<sup>θ</sup> and V<sup>ν</sup> that have a low upper bound of their Lipschitz constant. The purpose of this regularization is that networks with low Lipschitz upper bound are easier to check by the verifier module, i.e. requiring a coarser discretization grid. The value of λ > 0 decides the strength of the regularization that is applied. The regularization loss is based on the upper bound derived in [57] and defined as

$$\mathcal{L}\_{\text{Lipechitz}}(\theta) = \max \left\{ L\_{V\_{\theta}} - \frac{\delta}{\tau \cdot \left( L\_f \cdot \left( L\_\pi + 1 \right) + 1 \right)}, 0 \right\}. \tag{2}$$

In the case of a reach-avoid specification, the RASM certificate loss is

$$\mathcal{L}\_{\text{Certificate}}(\theta,\nu) = \mathcal{L}\_{\text{Expected}}(\theta,\nu) + \mathcal{L}\_{\text{Unsafe}}(\nu) + \mathcal{L}\_{\text{Init}}(\nu),\tag{3}$$

with

$$\mathcal{L}\_{\text{Expected}}(\theta,\nu) = \frac{1}{|C\_{\text{decrease}}|} \cdot \sum\_{\mathbf{x} \in C\_{\text{Expected}}} \left( \max \left\{ \begin{array}{c} \\ \omega\_{1}, \ldots, \omega\_{N} \sim \mathcal{N} \end{array} \right\} \right),$$

$$\sum\_{\omega\_{1}, \ldots, \omega\_{N} \sim \mathcal{N}} \frac{V\_{\nu} \left( f(\mathbf{x}, \pi\_{\theta}(\mathbf{x}), \omega\_{i}) \right)}{N} - V\_{\theta}(\mathbf{x}) + \tau \cdot K, 0 \right\} \Big|\_{\theta}.$$

$$\mathcal{L}\_{\text{Int}}(\nu) = \max\_{\mathbf{x} \in C\_{\text{init}}} \{V\_{\nu}(\mathbf{x}) - 1, 0\}$$

$$\mathcal{L}\_{\text{Unsafe}}(\nu) = \max\_{\mathbf{x} \in C\_{\text{cumafe}}} \{\frac{1}{1 - p} - V\_{\nu}(\mathbf{x}), 0\}.$$

The sets Cexpected, Cinit and Cunsafe are the training sets for achieving the expected decrease, initial and unsafe RASM conditions. Each of the three sets is

initialized with a coarse discretization of the state space to guide the learning toward learning a correct RASM already in the first loop iteration. In the subsequent calls to the learner, these sets are extended by counterexamples computed by the verifier. In [68] it was shown that, if V<sup>θ</sup> is a RASM and satisfies all conditions checked by the verifier below, then LCertificate(θ, ν) → 0 as the number of samples N used to estimate expected values in LExpected(θ, ν) increases.

## 4.3 The Verifier module

Verification task. The verifier now formally checks whether the learned RASM candidate V<sup>ν</sup> satisfies the four RASM defining conditions in Definition 3. Since we applied the softplus activation function to the output neuron of Vν, we know that the Nonnegativity condition is satisfied by default. Thus, the verifier only needs to check the Initial, Safety and Expected Decrease conditions in Definition 3.

Expected Decrease condition. To check the Expected Decrease condition, we utilize the fact that the dynamics function f is Lipschitz continuous and that the state space X is compact to show that it suffices to check a slightly stricter condition at the discretization points. Let L<sup>f</sup> be a Lipschitz constant of f. Since π<sup>θ</sup> and V<sup>ν</sup> are continuous functions defined over the compact domain X , we know that they are also Lipschitz continuous. Let L<sup>π</sup> and L<sup>V</sup> be their Lipschitz constants. We assume that L<sup>f</sup> is provided to the algorithm, and use the method of [57] for computing neural network Lipschitz constants to compute L<sup>π</sup> and L<sup>V</sup> .

To verify the Expected Decrease condition, the verifier collects a subset X˜ <sup>e</sup> ⊆ X˜ of all discretization vertices whose adjacent grid cells contain a nontarget state and over which V<sup>ν</sup> attains a value that is smaller than <sup>1</sup> 1−p . To compute this set, the algorithm first collects all grid cells that intersect X \Xt. For each collected cell, it then uses interval arithmetic abstract interpretation (IA-AI) [24,30] to propagate interval bounds across neural network layers towards bounding from below the minimal value that V<sup>ν</sup> attains over the cell. Finally, it adds to X˜ <sup>e</sup> vertices of those cells at which the computed lower bound is less than 1/(1 − p).

Finally, the verifier checks if the following condition is satisfied at each x˜ ∈ X˜ e

$$\mathbb{E}\_{\omega \sim d} \left[ V\_{\nu} \left( f(\tilde{\mathbf{x}}, \pi\_{\theta}(\tilde{\mathbf{x}}), \omega) \right) \right] < V\_{\nu}(\tilde{\mathbf{x}}) - \tau \cdot K,\tag{4}$$

where K = L<sup>V</sup> · (L<sup>f</sup> · (L<sup>π</sup> + 1) + 1). Note that this condition is a strengthened version of the Expected Decrease condition, where instead of strict decrease by arbitrary > 0 we require strict decrease by at least τ · K which depends on the discretization mesh τ and Lipschitz constants of f, π<sup>θ</sup> and Vν. To compute Eω∼d[Vν(f(x˜, πθ(x˜), ω))] in eq. (4), we cannot simply evaluate the expected value in state x˜ by substituting x˜ into some expression, as we do not know a closedform expression for the expected value of a neural network function. Instead, the algorithm uses the method of [44] to compute upper and lower bounds on the expected value of a neural network function, which we describe in Section 5. This upper bound is then plugged it into eq. (4).

If no violations to eq. (4) are found, the verifier concludes that the Expected Decrease condition is satisfied. Otherwise, for any counterexample x˜ to eq. (4), the algorithm checks if x˜ ∈ X \X<sup>t</sup> and Vν(x) < 1/(1 − p) and if so adds it to the counterexample set Cdecrease.

Initial and safety conditions. The Initial and Safety conditions are checked using IA-AI. To check the Initial condition, the verifier collects the set Cells<sup>X</sup><sup>0</sup> of all grid cells that intersect the initial set X0, and for each cell in Cells<sup>X</sup><sup>0</sup> checks if

$$\sup\_{\mathbf{x}\in\text{cell}} V\_{\nu}(\mathbf{x}) > 1. \tag{5}$$

The supremum is bounded from above via IA-AI by propagating interval bounds across neural network layers. If no violations are found, the verifier concludes that V<sup>ν</sup> satisfies the Initial condition. Otherwise, vertices of any grid cells which are counterexamples to eq. (5) and which are contained in X<sup>0</sup> are added to Cinit. Analogously, to check the Safety condition, the verifier collects the set Cells<sup>X</sup><sup>u</sup> of all grid cells that intersect the unsafe set Xu, and for each cell checks if

$$\inf\_{\mathbf{x}\in\text{cell}}V\_{\nu}(\mathbf{x}) < \frac{1}{1-p}.\tag{6}$$

If no violations are found, the verifier concludes that V<sup>ν</sup> satisfies the Safety condition. Otherwise, vertices of any grid cells which are counterexamples to eq. (6) and which are contained in X<sup>u</sup> are added to Cunsafe.

Algorithm output and correctness. If all three checks are successful and no counterexample is found, the algorithm concludes that π<sup>θ</sup> guarantees reach-avoidance with probability at least p and outputs the policy pθ. Otherwise, it proceeds to the next learner-verifier iteration where computed counterexamples are added to sets Cinit, Cunsafe and Cdecrease to be used by the learner. The following theorem establishes correctness of the verifier module, and its proof can be found in [68].

Theorem 6 ([68]). Suppose that the verifier verifies that the certificate V<sup>ν</sup> satisfies eq. (4) for each x˜ ∈ X˜ <sup>e</sup>, eq. (5) for each cell ∈ Cells<sup>X</sup><sup>0</sup> and eq. (6) for each cell ∈ Cells<sup>X</sup><sup>u</sup> . Then the function V<sup>ν</sup> is an RASM for the system with respect to Xt, X<sup>u</sup> and p.

Optimizations. The verification task can be made more efficient by a discretization refinement procedure. In particular, the verifier may start with a coarse grid and decomposes each grid cell on demand into a finer discretization in case the check when some RASM condition fails. This procedure can be used recursively to refine further in the case when elements of the decomposed grid cannot be verified. In case the recursion encounters a grid element that violates Eq. 4 even for τ = 0, the refinement procedure terminates unsuccessfully with the grid center point as a counterexample of the RASM condition. This optimization with a maximum recursion depth of 1 has been applied in [68].

## 5 Bounding Expected Values of Neural Networks

We now present the method for computing upper and lower bounds on the expected value of a neural network function over a given probability distribution. We are not aware of any existing methods for solving this problem, so believe that this is a result of independent interest.

To define the setting of the problem at hand, let x ∈ X ⊆ R <sup>n</sup> be a system state and suppose that we want to compute upper and lower bounds the expected value Eω∼d[V (f(x, π(x), ω))]. Here d is a probability distribution over the stochastic disturbance space N ⊆ R p from which the stochastic disturbance is sampled independently at each time step. As noted in Section 2, we assume that d is a product of independent univariate probability distributions. Alternatively, the method is also applicable if the support of d is bounded.

The method first partitions the stochastic disturbance space N ⊆ R p into finitely many cells cell(N ) = {N1, . . . , Nk}. Let maxvol = max<sup>N</sup>i∈cell(N) vol(Ni) and minvol = min<sup>N</sup>i∈cell(N) vol(Ni) denote the maximal and the minimal volume of any cell in the partition with respect to the Lebesgue measure over R p , respectively. Also, for each ω ∈ N let F(ω) = V (f(x, π(x), ω)). The upper and the lower boundd on the expected value are computed as follows

$$\mathbb{E}\_{\omega \sim d} \left[ V \left( f(\mathbf{x}, \pi(\mathbf{x}), \omega) \right) \right] \le \sum\_{\mathcal{N}\_i \in \text{cell}(\mathcal{N})} \max \text{vol} \cdot \sup\_{\omega \in \mathcal{N}\_i} F(\omega),$$

$$\mathbb{E}\_{\omega \sim d} \left[ V \left( f(\mathbf{x}, \pi(\mathbf{x}), \omega) \right) \right] \ge \sum\_{\mathcal{N}\_i \in \text{cell}(\mathcal{N})} \min \text{vol} \cdot \inf\_{\omega \in \mathcal{N}\_i} F(\omega).$$

Each supremum (resp. infimum) in the sum is then bounded from above (resp. from below) via interval arithmetic abstract interpretation by using the method of [30].

If the support of d is bounded, then no further adjustments are needed. However, if the support of d is unbounded, maxvol and minvol may not be finite. In this case, since we assume that d is a product of univariate distributions, the method first applies the probability integral transform [48] to each univariate probability distribution in d in order to reduce the problem to the case of a probability distribution of bounded support.

## 6 Discussion on Extension to General Certificates

The focus of this survey has primarily been on three concrete classes of supermartingale certificate functions in stochastic systems, namely RSMs, SBFs and RASMs, and the learner-verifier framework for their computation. For each class of supemartingale certificate functions, the learner module encodes the defining conditions of the certificate as a differentiable loss function whose minimization leads to a candidate certificate function. The verifier module then formally checks whether the defining conditions of the certificate function are satisfied. These checks are performed by discretizing the state space and using interval arithmetic abstract interpretation and the previously discussed method for computing bounds on expected values of neural network functions.

It should be noted that the design of both the learner and the verifier modules was not specifically tailored to any of the three certificate functions. Rather, both the learner and the verifier follow very general design principles that we envision are applicable to more general classes of certificate functions. In particular, we hypothesize that as long as the state space of the system is compact and a certificate function can be defined in terms of


then the learner-verifier framework in Section 4 may present a promising approach to learning and verifying the certificate function. In particular, the learnerverifier framework presents a natural candidate for automating the computation of any supermartingale certificate function that may be designed for other properties in the future. Furthermore, while RSMs, SBFs and RASMs exhibit a supermartingale-like behavior which is fundamental for their soundness, the learner-verifier framework does not rely or depend on their supermartingale-like behavior. Hence, we envision that the learner-verifier framework could also be used to compute other classes of stochastic certificate functions.

Even more generally, note that all certificate functions that we have considered so far are of the type X → R. One could also consider extensions of the learner-verifier framework to learning certificate functions of different datatypes. For instance, the work [43] uses a learner-verifier framework to learn an inductive transition invariant of type X × X → R that certifies safety in deterministic systems. On the other hand, lexicographic ranking supermartingales are a multidimensional generalization of RSMs of type X → R k that provide a more efficient and compositional approach to proving probability 1 termination in probabilistic programs [5,22]. Studying possible extensions of the learner-verifier framework for stochastic systems to learn certificate functions of different arity of both domain and codomain is a very interesting direction of future work.

## 7 Related Work

Existing learning-based methods for learning and verification of certificate functions in deterministic and stochastic systems have been discussed in Section 1. In this section, we overview some other existing methods for verification and control of stochastic dynamical systems, as well as some other uses of martingale theory in stochastic system verification.

Abstraction-based methods. Another class of approaches to stochastic dynamical system control with formal safety guarantees are abstraction based methods [56,42,14,63,60,25]. These methods consider finite-time horizon systems and approximate them via a finite-state Markov decision process (MDP). The control problem is then solved for the obtained MDP and the computed policy is used to exhibit a policy for the original stochastic dynamical system. The key difference in applicability between abstraction based methods and our framework is that abstraction based methods consider finite-time horizon systems, whereas we consider infinite-time horizon systems.

Safe control via shielding. Shielding is an RL framework that ensures safety in the context of avoidance of unsafe regions by computing two control policies – the main policy that optimizes the expected reward, and the backup policy that the system falls back to whenever the safety constraint may be violated [7,36,29]. Constrained MDPs. A standard approach to safe RL is to solve constrained MDPs (CMDPs) [8,28] which impose hard constraints on expected cost for one or more auxiliary cost functions. Several efficient RL algorithms for solving CMDPs have been proposed [59,4], however their constraints are only satisfied in expectation, hence constraint satisfaction is not formally guaranteed.

RL reward specification and neurosymbolic methods. There are several works on solving model-free RL tasks under logic specifications. In particular, several works propose methods for designing reward functions that encode temporal logic specifications [6,12,32,31,45,34,13,40,39]. Formal methods have also been used for extraction of interpretable policies [62,61,35] and safe RL [10,67,11].

Deterministic systems with stochastic controllers. Another way to give rise to a stochastic dynamical system is to consider a dynamical system with deterministic dynamics function and use a stochastic controller, which helps in quantifying uncertainty in the controller's prediction. Formal verification of deterministic dynamical systems with Bayesian neural network controllers has been considered in [43]. In particular, this work also uses a learner-verifier method to learn an inductive invariant for the deterministic system which formally proves safety.

Supermartingales for probabilistic program analysis. Supermartingales have also been used for the analysis of probabilistic programs (PPs). In particular, RSMs were originally introduced in the setting of PPs to prove almost-sure termination [15] and have since been extensively used, see e.g. [19,20,5,47,22]. The work [1] proposed a learner-verifier method to learn an RSM in the PP. Supermartingales were also used for safety [23,64,21], cost [65] and recurrence and persistence [16] analysis in PPs.

## 8 Conclusion

This paper presents a framework for learning-based control with formal reachability, safety and reach-avoidance guarantees in stochastic dynamical systems. We present a learner-verifier framework in which a neural network control policy is learned together with a neural network certificate function that formally proves that the property of interest holds with at least some desired probability p ∈ [0, 1]. For certification, we use supermartingale certificate functions. The learner module encodes the defining certificate function conditions into a differentiable loss function which is then minimized to learn a candidate certificate function. The verifier then formally verifies the candidate by using interval arithmetic abstract interpretation and a novel method for computing bounds on expected values of neural networks.

The learner-verifier framework presented in this work opens several interesting directions for future work. The first is the design of supermartingale certificates for more general properties of stochastic systems and the use of our learner-verifier framework for their computation. The second is to study and understand the general class of certificate functions in stochastic systems that the learner-verifier can be used to compute, possibly going beyond supermartingale certificate functions. Finally, on the practical side, a venue for future work is to explore methods for reducing the computational cost of the framework and extensions that can handle more complex and higher dimensional systems.

## References


342. ACM (2016). https://doi.org/10.1145/2837614.2837639, https://doi. org/10.1145/2837614.2837639


L. (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 25th International Conference, TACAS 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6-11, 2019, Proceedings, Part I. Lecture Notes in Computer Science, vol. 11427, pp. 395–412. Springer (2019). https://doi.org/10.1007/ 978-3-030-17462-0\_27, https://doi.org/10.1007/978-3-030-17462-0\_27


//doi.org/10.1016/j.automatica.2013.10.013, https://doi.org/10.1016/j. automatica.2013.10.013


ACM (2019). https://doi.org/10.1145/3302504.3311809, https://doi.org/ 10.1145/3302504.3311809


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Model Checking**

## Bounded Model Checking for Asynchronous Hyperproperties<sup>⋆</sup>

Tzu-Han Hsu<sup>1</sup> , Borzoo Bonakdarpour<sup>1</sup> () , Bernd Finkbeiner<sup>2</sup> , and C´esar S´anchez<sup>3</sup>

<sup>1</sup> Michigan State University, East Lansing, MI, USA {tzuhan,borzoo}@msu.edu <sup>2</sup> CISPA Helmholtz Center, Saarbr¨ucken, Germany finkbeiner@cispa.de 3 IMDEA Software Institute, Madrid, Spain cesar.sanchez@imdea.org

Abstract. Many types of attacks on confdentiality stem from the nondeterministic nature of the environment that computer programs operate in. We focus on verifcation of confdentiality in nondeterministic environments by reasoning about asynchronous hyperproperties. We generalize the temporal logic A-HLTL to allow nested trajectory quantifcation, where a trajectory determines how diferent execution traces may advance and stutter. We propose a bounded model checking algorithm for A-HLTL based on QBF-solving for a fragment of A-HLTL and evaluate it by various case studies on concurrent programs, scheduling attacks, compiler optimization, speculative execution, and cache timing attacks. We also rigorously analyze the complexity of model checking A-HLTL.

## 1 Introduction

Motivation. Consider the concurrent program [10] shown in Fig. 1, where h is a secret variable, and await command is a conditional critical region. This program should satisfy the following information-fow policy: "Any sequences of observable outputs produced by an interleaving should be reproducible by some other interleaving for a diferent value of h". If this is the case, then an attacker cannot successfully guess the value of h from the sequence of observable outputs of the print() statements. For example, Fig. 2 shows how one can align two interleavings of threads T1 and T2 with respect to the observable sequence of outputs 'abcd', given two diferent values of secret h. Let us call such an alignment a trajectory (illustrated by the sequence of dashed lines). However, if

<sup>1</sup> Thread T1 ( ) { <sup>2</sup> awa i t sem>0 th en <sup>3</sup> sem = sem − 1 ; <sup>4</sup> p r i n t ( ' a ' ) ; <sup>5</sup> v = v+1; <sup>6</sup> p r i n t ( ' b ' ) ; <sup>7</sup> sem = sem + 1 ; 8 } 9 <sup>10</sup> Thread T2 ( ) { <sup>11</sup> p r i n t ( ' c ' ) ; <sup>12</sup> i f h th en <sup>13</sup> awa i t sem>0 th en <sup>14</sup> sem = sem − 1 ; <sup>15</sup> v = v+2; <sup>16</sup> sem = sem + 1 ; <sup>17</sup> e l s e <sup>18</sup> s ki p ; <sup>19</sup> p r i n t ( ' d ' ) ; 20 }

Fig. 1: T1 and T2 leak the value of h.

© The Author(s) 2023

<sup>⋆</sup> This research has been partially supported by the United States NSF SaTC Award 2100989, by the Madrid Regional Gov. Project BLOQUES-CM (S2018/TCS-4339), by Project PRODIGY (TED2021-132464B-I00) funded by MCIN/AEI/10.13039/501100011033/ and the EU NextGenerationEU/PRTR, by the German Research Foundation (DFG) as part of TRR 248 (389792660), and by the European Research Council (ERC) Grant HYPER (101055412)

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 29–46, 2023. https://doi.org/10.1007/978-3-031-30823-9 2

thread T1 holds the semaphore and executes the critical region as an atomic operation. Then, output 'acdb' arising due to concurrent execution of threads T1 and T2 reveals the value of h as 0, as the same output cannot be reproduced when h=1. Thus, the program in Fig. 1 violates the above policy.

The above policy is an example of a hyperproperty [5]; i.e., a set of sets of execution traces. In addition to information-fow requirements, hyperproperties can express other complex requirements such as linearizability [12] and control conditions in cyber-physical systems such as robustness and sensitivity. The temporal logic A-HLTL [1] can express hyperproperties whose sets of traces advance at diferent speeds, allowing stuttering steps. For example, the above policy can be expressed in A-HLTL by the following formula: φNI = ∀π.∃π ′ .Eτ.(hπ,τ ≠ hπ′ ,τ ) ∧ (obsπ,τ = obsπ′ ,τ ), where obs denotes the output observations, meaning that for all executions (i.e., interleavings) π, there should exist another execution π ′ and a trajectory τ , such that π and π ′ start from diferent values of h and τ can align all the observations along π and π ′ (see Fig. 2). A-HLTL can reason about one source of nondeterminism by the scheduler in the system that may lead to information leak. Indeed, the model checking algorithms proposed in [1] can discover the bug in the program in Fig. 1.

 Thread T1 ( ) { w h i l e ( t r u e ) { aw a i t sem>0 th en sem = sem − 1 ; p r i n t ( ' a ' ) ; <sup>6</sup> v = v+1; p r i n t ( ' b ' ) ; sem = sem + 1 ; 9 } 10 } Thread T2 ( ) { w h i l e ( t r u e ) h = r e ad ( Channe l1 ) ; 15 } Thread T3 ( ) { w h i l e ( t r u e ) { pr i n t ( ' c ' ) ; i f ( h == l ) th en aw a i t sem>0 th en sem = sem − 1 ; <sup>23</sup> v = v+2; sem = sem + 1 ; <sup>25</sup> e l s e <sup>26</sup> sk i p ; pr i n t ( ' d ' ) ; 28 } 29 } Thread T4 ( ) { w h i l e ( t r u e ) l = r e ad ( Channe l2 ) ; 34 }

Now, consider a more complex version of the same program shown in Fig. 3 inspired by modern programming languages such as Go and P that allow CSP-style concurrency. Here, new threads T3 and T4 read the values of secret input h and public input l from two asyn-

Fig. 3: T1 and T2 receive inputs from asynch. channels read by T3 and T4.

chronous channels, rendering two diferent sources of nondeterminism: (1) the scheduler that results in diferent interleavings, and (2) data availability in the channels. This, in turn, means formula φNI no longer captures the following specifcation of the program, which should be:

"Any sequence of observable outputs produced by an interleaving should be reproducible by some other interleaving such that for all alignments of public inputs, there exists an alignment of the public outputs".

Satisfaction of this policy (not expressible in A-HLTL as proposed in [1]) prohibits an attacker from successfully determining the sequence of values of h.

Fig. 2: Two secure interleavings for the program in Fig. 1

Contributions. In this paper, we strive for a general logic-based approach that enables model checking of a rich set of asynchronous hyperproperties. To this end, we concentrate on A-HLTL model checking for programs subject to multiple sources of nondeterminism. Our frst contribution is a generalization of A-HLTL that allows nested trajectory quantifcation. For example, the above policy requires reasoning about two diferent trajectories that cannot be composed into one since their sources of nondeterminism are diferent. This observation motivates the need for enriching A-HLTL with the tools to quantify over trajectories. This generalization enables expressing policies such as follows:

$$
\varphi\_{\mathsf{M}\_{\mathsf{nd}}} = \forall \pi. \exists \pi'. \mathsf{A}\tau. \mathsf{E}\tau'. (\mathsf{Q}(\mathsf{h}\_{\pi,\tau} \neq \mathsf{h}\_{\pi',\tau}) \land \mathsf{T}(\mathsf{l}\_{\pi,\tau} = \mathsf{l}\_{\pi',\tau})) \rightarrow \mathsf{T}(\mathsf{obs}\_{\pi,\tau'} = \mathsf{obs}\_{\pi',\tau'}),
$$

where A and E denote the universal (res., existential) trajectory quantifers.

Our second contribution is a bounded model checking (BMC) algorithm for a fragment of the extended A-HLTL that allows an arbitrary number of trace quantifer alternations and up to one trajectory quantifer alternation. Following [15], we propose two bounded semantics (called optimistic and pessimistic) for A-HLTL based on the satisfaction of eventualities. We introduce a reduction to the satisfability problem for quantifed Boolean formulas (QBF) and prove that our translation provides decision procedures for A-HLTL BMC for terminating systems, i.e., those whose Kripke structure is acyclic. Our focus on terminating programs is due to the general undecidability of A-HLTL model checking [1]. As in the classic BMC for LTL, the power of our technique is in hunting bugs that are often in the shallow parts of reachable states.

Our third contribution is rigorous complexity analysis of A-HLTL model checking for terminating programs (see Table 1). We show that for formulas with only one trajectory quantifer the complexity is aligned with that of classic synchronous semantics of HyperLTL [4]. However, the complexity of A-HLTL model checking with multiple trajectory quantifers is one step higher than HyperLTL model checking in the polynomial hierarchy. An interesting observation here is that the complexity of model checking a formula with two existential trajectory quantifers is one step higher than one with only one existential quantifer


Table 1: A-HLTL model checking complexity for acyclic models.

although the plurality of the quantifers does not change. Generally speaking, A-HLTL model checking for terminating programs remains in PSPACE.

Finally, we have implemented our BMC technique. We evaluate our implementation on verifcation of four case studies: (1) information-fow security in concurrent programs, (2) information leak in speculative executions, (3) preservation of security in compiler optimization, and (4) cache-based timing attacks. These case studies exhibit a proof of concept for the highly intricate nature of information-fow requirements and how our foundational theoretical results handle them.

Related Work. The concept of hyperproperties is due to Clarkson and Schneider [5]. HyperLTL [4] and A-HLTL are currently the only logics for which practical model checking algorithms are known [8,7,15,1]. For HyperLTL, the algorithms have been implemented in the model checkers MCHyper and bounded model checker HyperQB [14]. HyperLTL is limited to synchronous hyperproperties. The A-HLTL model checking problem is known to be undecidable in general [1]. However, decidable fragments that can express observational determinism, noninterference, and linearizability have been identifed. This paper generalizes A-HLTL by allowing nested trajectory quantifers and due to the general undecidability result focuses on terminating programs.

FOL[E] [6] can express a limited form of asynchronous hyperproperties. As shown in [6], FOL[E] is subsumed by HyperLTL with additional quantifcation over predicates. For S1S[E] and Hµ, the model checking problem is in general undecidable; for Hµ, two fragments, the k-synchronous, k-context bounded fragments, have been identifed for which model checking remains decidable [11]. Other logical extensions of HyperLTL with asynchronous capabilities are studied in [3], including their decidable fragments, but their model checking problems have not been implemented and the relative expressive power with respect to other asynchronous formalisms has not been studied.

## 2 Extended Asynchronous HyperLTL

Preliminaries. Given a natural number k ∈ N0, we use [k] for the set {0, . . . , k}. Let AP be a set of atomic propositions and Σ = 2AP be the alphabet, where we call each element of Σ a letter. A trace is an infnite sequence σ = a0a<sup>1</sup> · · · of letters from Σ. We denote the set of all infnite traces by Σ<sup>ω</sup>. We use σ(i) for a<sup>i</sup> and σ i for the sufx aiai+1 · · · . A pointed trace is a pair (σ, p), where p ∈ N<sup>0</sup> is a natural number (called the pointer). Pointed traces allow to traverse a trace by moving the pointer. Given a pointed trace (σ, p) and n > 0, we use (σ, p) + n to denote the resulting trace (σ, p + n). We denote the set of all pointed traces by PTR = {(σ, p) | σ ∈ Σ <sup>ω</sup> and p ∈ N0}.

A Kripke structure is a tuple K = ⟨S, sinit, δ, L⟩, where S is a set of states, sinit ∈ S is the initial state, δ ⊆ S × S is a transition relation, and L : S → Σ is a labeling function on the states of K. We require that for each s ∈ S, there exists s ′ ∈ S, such that (s, s′ ) ∈ δ. ⊓⊔

A path of a Kripke structure K is an infnite sequence of states s(0)s(1)· · · ∈ S <sup>ω</sup>, such that s(0) = sinit and (s(i), s(i + 1)) ∈ δ, for all i ≥ 0. A trace of K is a sequence σ(0)σ(1)σ(2)· · · ∈ Σ <sup>ω</sup>, such that there exists a path s(0)s(1)· · · ∈ S ω with σ(i) = L(s(i)) for all i ≥ 0. We denote by Traces(K, s) the set of all traces of K with paths that start in state s ∈ S.

The directed graph F = ⟨S, δ⟩ is called the Kripke frame of the Kripke structure K. A loop in F is a fnite sequence s0s<sup>1</sup> · · · sn, such that (s<sup>i</sup> , si+1) ∈ δ, for all 0 ≤ i < n, and (sn, s0) ∈ δ. We call a Kripke frame acyclic, if the only loops are self-loops on terminal states, i.e., on states that have no other outgoing transition. Acyclic Kripke structures model terminating programs.

Extended A-HLTL. The syntax of extended A-HLTL is:

$$\begin{aligned} \varphi &::= \exists \pi.\varphi \mid \forall \pi.\varphi \mid \mathsf{E}\tau.\varphi \mid \mathsf{A}\tau.\varphi \mid \psi\\ \psi &::= true \mid a\_{\pi,\tau} \mid \neg\psi \mid \psi\_1 \lor \psi\_2 \mid \psi\_1 \land \psi\_2 \mid \psi\_1 \; \mathcal{U}\,\psi\_2 \mid \psi\_1 \; \mathcal{R}\,\psi\_2 \end{aligned}$$

where a ∈ AP, π is a trace variable from an infnite supply V of trace variables, τ is a trajectory variable from an infnite supply J of trajectory variables (see formula φNInd in Section 1 for an example). The intended meaning of aπ,τ is that proposition a ∈ AP holds in the current time in trace π and trajectory τ (explained later). Trace (respectively, trajectory) quantifers ∃π and ∀π (respectively, Eτ and Aτ ) allow reasoning simultaneously about diferent traces (respectively, trajectories). The intended meaning of E is that there is a trajectory that gives an interpretation of the relative passage of time between the traces for which the temporal formula that relates the traces is satisfed. Dually, A means that all trajectories satisfy the inner formula. Given an A-HLTL formula φ, we use Paths(φ) (respectively, Trajs(φ)) for the set of trace (respectively, trajectory) variables quantifed in φ. A formula φ is well-formed if for all atoms aπ,τ in φ, π and τ are quantifed in φ (i.e., τ ∈ Trajs(φ) and π ∈ Paths(φ)) and no trajectory/trace variable is quantifed twice in φ. We use the usual syntactic sugar false ≜ ¬true, and φ ≜ true U φ, φ<sup>1</sup> → φ<sup>2</sup> ≜ ¬φ1∨φ2, and φ ≜ ¬ ¬φ, etc. We choose to add R (release) and ∧ to the logic to enable negation normal form (NNF). As our BMC algorithm cannot handle formulas that are not invariant under stuttering, the next operator is not included.

Semantics. A trajectory t : t(0)t(1)t(2)· · · for a formula φ is an infnite sequence of subsets of Paths(φ), i.e., each t<sup>i</sup> ⊆ Paths(φ), for all i ≥ 0. Essentially, in each step of the trajectory one or more of the traces make progress or all may stutter. A trajectory is fair for a trace variable π ∈ Paths(φ) if there are infnitely many positions j such that π ∈ t(j). A trajectory is fair if it is fair for all trace variables in Paths(φ). Given a trajectory t, by t i , we mean the sufx t(i)t(i + 1)· · · . Furthermore, for a set of trace variables V, we use TRJ<sup>V</sup> for the set of all fair trajectories for indices from V. We also use a trajectory assignment Γ : Trajs(φ) ⇀ TRJDom(Γ) , where Dom(Γ) is the subset of Trajs(φ) for which Γ is defned. Given a trajectory assignment Γ, a trajectory variable τ , and a trajectory t, we denote by Γ[τ 7→ t] the assignment that coincides with Γ for every trajectory variable except for τ , which is mapped to t.

For the semantics of extended A-HLTL, we need asynchronous trace assignments Π : Paths(φ) × Trajs(φ) → T × N which map each pair (π, τ ) formed by a path variable and trajectory variable into a pointed trace. Given (Π, Γ) where Π is an asynchronous trace assignment and Γ a trajectory assignment, we use (Π, Γ) + 1 for the successor of (Π, Γ) defned as (Π′ , Γ′ ) where Γ ′ (τ ) = Γ(τ ) 1 , and Π′ (π, τ ) = Π(π, τ ) + 1 if π ∈ Γ(τ )(0) and Π′ (π, τ ) = Π(π, τ ) otherwise. Note that Π can assign the same π to diferent pointed traces depending on the trajectory. We use (Π, Γ) + k as the k-th successor of (Π, Γ). Given an asynchronous trace assignment Π, a trace variable π, a trajectory variable τ a trace σ, and a pointer p, we denote by Π[(π, τ ) 7→ (σ, p)] the assignment that coincides

Fig. 4: Kripke structure K and traces t<sup>1</sup> and t<sup>2</sup> of K, K |= φNInd but K ̸|= φNI.

with Π for every pair except for (π, τ ), which is mapped to (σ, p). The satisfaction of an A-HLTL formula φ over a trace assignment Π, a trajectory assignment Γ, and a set of traces T is defned as follows (we omit ¬, ∧ and ∨ which are standard):


We say that a set T of traces satisfes a sentence φ, denoted by T |= φ, if (Π∅, Γ∅) |=<sup>T</sup> φ. We say that a Kripke structure K satisfes an A-HLTL formula φ (and write K |= φ) if and only if we have Traces(K, Sinit) |= φ. An example is illustrated in Fig. 4.

## 3 Bounded Model Checking for A-HLTL

We frst introduce the bounded semantics of A-HLTL (for at most one trajectory quantifer alternation but arbitrary trace quantifers) which will be used to generate queries to a QBF solver to aid solving the BMC problem. The main result of this section is Theorem 1 which provides decision procedures for model checking A-HLTL for terminating systems.

#### 3.1 Bounded Semantics of A-HLTL

The bounded semantics corresponds to the exploration of the system up to a certain bound. In our case, we will consider two bounds k and m (with k ≤ m). The bound k corresponds to the maximum depth of the unrolling of the Kripke structures and m is the bound on trajectories length. We start by introducing some auxiliary functions and predicates, for a given trace assignment and (Π, Γ). First, the family of functions posπ,τ : {0 . . . m} → N. The meaning of posπ,τ (i) provides how many times π has been selected in {τ (0), . . . , τ (i)}. We assume that Kripke structures are equipped with an atomic proposition halt (one per trace variable π) which encodes whether the state is a halting state. Given (Π, Γ) we consider the predicate halted that holds whenever for all π and τ , halt ∈ σ(j) for (σ, j) = Π(π, τ ). In this case we write (Π, Γ, n) |= halted.

We defne two bounded semantics which only difer in how they inspect beyond the (k, m) bounds: |= hpes k,m , called the halting pessimistic semantics and |= hopt k,m , called the halting optimistic semantics. We start by defning the bounded semantics of the quantifers.


$$\begin{array}{ccccc} & & & \cdot \\ & & & \cdot \\ & & & \cdot \end{array} \quad \begin{array}{ccccc} & & \cdot \\ & & \cdot \\ & & & \cdot \end{array} \quad \begin{array}{ccccc} & & \begin{array}{c} \\ \Pi, \Gamma[\tau \to t], 0 \end{array} \Big| \stackrel{\scriptstyle \text{Dom}(\Pi)}{\mid =\_{k,m}} \psi \end{array} \tag{3}$$

$$(\Pi, \Gamma, 0) \vdash\_{k, m} \mathsf{A}\tau.\,\psi \qquad \text{iff} \qquad \text{for all } t \in \mathsf{TRJ}\_{Dom(II)}:$$

$$(\Pi, \Gamma[\tau \to t], 0) \vdash\_{k, m} \psi \tag{4}$$

For the Boolean operators, for i ≤ m:

(Π, Γ, i) |=k,m true (5) (Π, Γ, i) |=k,m aπ,τ if a ∈ (σ, j) where (σ, j) = Π(π, τ )(i) and j ≤ k (6) (Π, Γ, i) |=k,m ¬aπ,τ if a ̸∈ (σ, j) where (σ, j) = Π(π, τ )(i) and j ≤ k (7) (Π, Γ, i) |=k,m ψ<sup>1</sup> ∨ ψ<sup>2</sup> if (Π, Γ, i) |=k,m ψ<sup>1</sup> or (Π, Γ, i) |=k,m ψ<sup>2</sup> (8) (Π, Γ, i) |=k,m ψ<sup>1</sup> ∧ ψ<sup>2</sup> if (Π, Γ, i) |=k,m ψ<sup>1</sup> and (Π, Γ, i) |=k,m ψ<sup>2</sup> (9)

For the temporal operators, we must consider the cases of falling of the paths (beyond k) and falling of the traces (beyond m). We defne the predicate of which holds for (Π, Γ, i) if for some (π, τ ), posπ,τ (i) > k and halt<sup>π</sup> ∈/ σ(k) where σ is the trace assigned to π. Note that halted implies that of does not hold because all paths (including those at k or beyond) satisfy halt.

We defne two semantics that difer on how to interpret when the end of the unfolding of the traces and trajectories is reached. The halting pessimistic semantics, denoted by |= hpes k,m take (1)-(9) above and add (10)-(13) together with (Π, Γ, i) ̸|=k,m of . Rules (10) and (11) defne the semantics of the temporal operators for the case i < m, that is, before the end of the unrolling of the trajectories (recall that we do not consider ):

$$\begin{array}{llll} \left(\varPi,\varGamma,i\right) \left| =\_{k,m} & \psi\_1 \amalg \psi\_2 & \text{iff} \ (\varPi,\varGamma,i) \left| =\_{k,m} \psi\_2 \text{, or } (\varPi,\varGamma,i) \left| =\_{k,m} \psi\_1 \text{, and} \\ & (\varPi,\varGamma,i) + 1 \right| =\_{k,m} \psi\_1 \amalg \psi\_2 & \end{array} \right. \end{array} \tag{10}$$

$$\begin{array}{lll} (H,\Gamma,i) \vdash\_{k,m} & \psi\_1 \,\,\mathcal{R} \,\psi\_2 & \text{iff } (H,\Gamma,i) \vdash\_{k,m} \psi\_2 \,\,\text{and } (H,\Gamma,i) \vdash\_{k,m} \psi\_1 \,\,\text{or} \\ & & (H,\Gamma,i) + 1 \vdash\_{k,m} \psi\_1 \,\,\mathcal{R} \,\psi\_2 \end{array} \tag{11}$$

For the case of i = m, that is, at the bound of the trajectory:

$$\begin{array}{lll} (H,\Gamma,m) \mid \stackrel{hps}{=}\_{k,m} & \psi\_1 \amalg \psi\_2 \quad \text{iff} \ (H,\Gamma,m) \mid =\_{k,m} \psi\_2\\ (H,\Gamma,m) \mid \stackrel{hps}{=}\_{k,m} & \psi\_1 \amalg \psi\_2 \quad \text{iff} \ (H,\Gamma,m) \mid =\_{k,m} \psi\_1 \wedge \psi\_2 \text{, or} \\ & & (H,\Gamma,m) \mid =\_{k,m} \textit{h} \land \not\models \psi\_2 \end{array} \tag{13}$$

The halting optimistic semantics, denoted by |= hopt k,m take rules (1)-(11) and (12′ )-(13′ ), but now if (Π, Γ, i) |= hopt k,m of then (Π, Γ, i) |= hopt k,m φ holds for every formula. Again, rules (10) and (11) defne the semantics of the temporal operators for the case i < m. Then, for i = m:

$$\begin{array}{llll} \left(\Pi,\Gamma,m\right) \left| \mathop{=}^{hopt}\_{k,m} \psi\_{1} \mathcal{U} \psi\_{2} & \text{iff} \ (\Pi,\Gamma,m) \left| \mathop{=}\_{k,m} \psi\_{2}, \text{or} \\ & & (\Pi,\Gamma,m) \left| \mathop{=}\_{k,m} \mathit{h} \text{lted} \wedge \psi\_{1} \text{or} \\ & & (\Pi,\Gamma,m) \left| \mathop{=}^{hopt}\_{k,m} \mathcal{D}\_{\neg k} \right| \text{ iff } (\Pi,\Gamma,m) \left| \mathop{=}^{h} \neg \psi\_{2} \right| \end{array} \tag{12}$$

$$\mathbb{P}\left(H,\Gamma,m\right) \left| \stackrel{hopt}{=}\_{k,m} \psi\_1 \,\, \mathcal{R} \,\psi\_2 \,\, \text{iff}\,\, \left(H,\Gamma,m\right) \left| =\_{k,m} \psi\_2\right.\tag{13'}$$

Similar to [15] for the case of HyperLTL, the pessimistic semantics capture the case where we assume that pending eventualities will not become true in the future after the end of the trace (this is also assumed in LTL BMC). Dually, the optimistic semantics assume that all pending eventualities at the end of the trace will be fulflled. Therefore, the following hold (proofs in [13]).

Lemma 1. Let k ≤ k ′ and m ≤ m′ . 1. If (Π, Γ, 0) |= hpes k,m φ, then (Π, Γ, 0) |= hpes k′ ,m′ φ. 2. If (Π, Γ, 0) ̸|= hopt k,m φ, then (Π, Γ, 0) ̸|= hopt k′ ,m′ φ.

Lemma 2. The following hold for every k and m,

1. If (Π, Γ, 0) |= hpes k,m φ, then (Π, Γ, 0) |= φ.

2. If (Π, Γ, 0) ̸|= hopt k,m φ, then (Π, Γ, 0) ̸|= φ.

#### 3.2 From Bounded Semantics to QBF Solving

Let K be a Kripke structure and φ be an A-HLTL formula. Based on the bounded semantics introduced previously, our main approach is to generate a QBF query (with bounds k, m), which can use either the pessimistic or the optimistic semantics. We use <sup>J</sup>K, φ<sup>K</sup> hpes k,m if the pessimistic semantics are used and <sup>J</sup>K, φ<sup>K</sup> hopt k,m if the optimistic semantics are used. Our translations will satisfy that


The frst step to defne <sup>J</sup>K, φ<sup>K</sup> hopt k,m and <sup>J</sup>K, φ<sup>K</sup> hpes k,m is to encode the unrolling of the models up-to a given depth k. For a path variable π corresponding to Kripke structure K, we introduce (k + 1) copies (x 0 , . . . , x<sup>k</sup> ) of the Boolean variables that defne the state of K and use the initial condition I and the transition relation R of K to relate these variables. For example, for k = 3, we unroll the transition relation up-to 3 as follows:

$$\mathbb{[K]}\_3 = I(x^0) \land R(x^0, x^1) \land R(x^1, x^2) \land R(x^2, x^3).$$

Encoding positions. For each trajectory variable τ and given the bound m on the unrolling of trajectories, we add Paths(φ) × (m + 1) variables t 0 π . . . t<sup>m</sup> π , for each π. The intended meaning of t j π is that t j π is true whenever π ∈ t(j), that is, when t dictates that π moves at time instant j. In order to encode sanity conditions on trajectories, that are crucial for completeness, it is necessary to introduce a family of variables that captures how much π has moved according to τ after j steps. There is a variable pos for each trace variable π, each trajectory τ and each i ≤ k and j ≤ m. We represent this variable by posi,j π,τ . The intention is that pos is true whenever after j steps trajectory τ has dictated that trace π progresses precisely i times. Fig. 5 shows encodings t j <sup>π</sup> and posi,j π,τ for the traces w.r.t. the blue trajectory, τ ′ in Fig. 4. We will use the auxiliary

Encodings of t j <sup>π</sup> and t j π′ : [t 0 <sup>π</sup> , t 1 <sup>π</sup> , t 2 <sup>π</sup> , t 3 <sup>π</sup> , t 4 <sup>π</sup> , t 5 <sup>π</sup> , t 6 π] [t 0 <sup>π</sup>′ , t 1 <sup>π</sup>′ , t 2 <sup>π</sup>′ , t 3 <sup>π</sup>′ , t 4 <sup>π</sup>′ , t 5 <sup>π</sup>′ , t 6 <sup>π</sup>′ ] Encodings of pos i,j π,τ′ and pos i,j π′,τ′ [pos 0,0 π,τ′ , pos 0,1 π,τ′ , pos 0,2 π,τ′ , pos 0,3 π,τ′ , pos 1,1 π,τ′ , pos 1,2 π,τ′ , pos 1,3 π,τ′ , pos 1,4 π,τ′ , pos 2,2 π,τ′ , pos 2,3 π,τ′ , pos 2,4 π,τ′ , pos 2,5 π,τ′ , pos 3,3 π,τ′ , pos 3,4 π,τ′ , pos 3,5 π,τ′ , pos 3,6 π,τ′ ] [pos 0,0 π′,τ′ , pos 0,1 π′,τ′ , pos 0,2 π′,τ′ , pos 0,3 π′,τ′ , pos 1,1 π′,τ′ , pos 1,2 π′,τ′ , pos 1,3 π′,τ′ , pos 1,4 π′,τ′ , pos 2,2 π′,τ′ , pos 2,3 π′,τ′ , pos 2,4 π′,τ′ , pos 2,5 π′,τ′ , pos 3,3 π′,τ′ , pos 3,4 π′,τ′ , pos 3,5 π′,τ′ , pos 3,6 π′,τ′ ]

Fig. 5: Variables for encodings of the blue trajectory in Fig. 4, where green variables are true and gray variables are false.

defnitions (for i ∈ {0 . . . k} and j ∈ {0 . . . m}) to force that the path π has moved to position i after j moves from the trajectory and that π has not fallen of the trace (and does not change position when the paths fall of the trace):

$$\begin{array}{c} setpos\_{\pi,\tau}^{i,j} \stackrel{\text{def}}{=} pos\_{\pi,\tau}^{i,j} \land \bigwedge\_{n \in \{0..k\}\{i\}} \neg pos\_{\pi,\tau}^{n,j} \land \neg off\_{\pi,\tau}^{j} \\\\ npos\_{\pi,\tau}^{j} \stackrel{\text{def}}{=} off\_{\pi,\tau}^{j} \land \bigwedge\_{n \in \{0..k\}} \neg pos\_{\pi,\tau}^{n,j} \end{array}$$

Initially, Ipos def = V π,τ setpos <sup>0</sup>,<sup>0</sup> π,τ , where π ∈ Traces(φ) and τ ∈ TRJDom(Π) . Ipos captures that all paths are initially at position 0. Then, for every step j ∈ {0 . . . m}, the following formulas relate the values of pos and of , depending on whether trajectory τ moves path π or not (and on whether π has reached the end k or halted):

$$step^j\_{\pi,\tau} \stackrel{\text{def}}{=} \bigwedge\_{i \in \{0..k-1\}} \left( pos^{i,j}\_{\pi,\tau} \wedge t^j\_{\pi} \to setpos^{i+1,j+1}\_{\pi,\tau} \right)$$

$$\begin{array}{c} \textit{stutter} \mathop{\rm s\!=}^{j}\_{\pi,\tau} \overset{\rm def}{=} \bigwedge\_{i \in \{0..k\}} \left( p o^{i,j}\_{\pi,\tau} \wedge \neg t^{j}\_{\pi} \rightarrow \textit{set} p o s^{i,j+1}\_{\pi,\tau} \right) \\\\end{array} \mathop{\rm e\!=}^{\rm def} \left( p o s^{k,j}\_{\pi,\tau} \wedge t^{j}\_{\pi} \right) \to \left( \left( \neg halt^{k}\_{\pi} \rightarrow \textit{nop} s^{j+1}\_{\pi,\tau} \right) \wedge \left( halt^{k}\_{\pi} \rightarrow \textit{set} p o s^{k,j+1}\_{\pi,\tau} \right) \right) \end{array}$$

Then the following formula captures the correct assignment to the the pos variables, including the initial assignment:

$$\varphi\_{pos} \stackrel{\text{def}}{=} I\_{pos} \land \bigwedge\_{j \in \{0..m\}} \bigwedge (step\_{\pi,\tau}^j \land stores\_{\pi,\tau}^j \land ends\_{\pi,\tau}^j)$$

For example, Fig. 5 (w.r.t. Fig. 4) encodes the blue trajectory (τ ′ ) of π (i.e., t1) and π ′ (i.e., t2) as follows. First, for j ∈ [0, 3), it advances t<sup>1</sup> and stutters t2. Therefore, t 0 π , t<sup>1</sup> π , t<sup>2</sup> <sup>π</sup> are true and t 0 <sup>π</sup>′ , t<sup>1</sup> <sup>π</sup>′ , t<sup>2</sup> <sup>π</sup>′ are false. Notice that for pos encodings, the π position advances according to step <sup>j</sup> π,τ′ (i.e., pos 0,0 π,τ′ , pos 1,1 π,τ′ , pos 2,2 π,τ′ , pos 3,3 π,τ′ ); while π ′ stutters according to stutters <sup>j</sup> π′ ,τ′ (i.e., pos 0,0 π′ ,τ′ , pos 0,1 π′ ,τ′ , pos 0,2 π′ ,τ′ , pos 0,3 π′ ,τ′ ). Then, for j ∈ [3, 5], it alternatively advances t<sup>2</sup> which makes t 3 π , t<sup>4</sup> π , t<sup>5</sup> π false and t 3 <sup>π</sup>′ , t<sup>4</sup> <sup>π</sup>′ , t<sup>5</sup> <sup>π</sup>′ true. Similarly, the movements becomes pos 3,4 π,τ′ , pos 3,5 π,τ′ , pos 3,6 π,τ′ and pos 1,4 π′ ,τ′ , pos 2,5 π′ ,τ′ , pos 3,6 π′ ,τ′ . At the halting point (i.e., j = k), both trajectory trigger ends <sup>j</sup> and do not advance anymore.

Encoding the inner LTL formula. We will use the following auxiliary predicates:

$$
tilde{d}^j \stackrel{\text{def}}{=} \bigwedge\_{\tau} \\
tilde{d}^j\_{\tau} \qquad \qquad \qquad \stackrel{\text{def}}{=} \bigvee\_{\pi,\tau}^j \stackrel{\text{def}}{=} \bigvee\_{\pi,\tau}^j \tag{3.7}
$$

We now give the encoding for the inner temporal formulas for a fx unrolling k and m as follows. For the atomic and Boolean formulas, the following translations are performed for j ∈ {0 . . . m}.

$$\underset{\pi}{\|p\_{\pi,\tau}\|\_{k,m}^{j}} := \bigvee\_{\substack{i \in \{0..k\} \ \vert \begin{array}{c} \vert \begin{array}{c} \vert \begin{array}{c} \vert \begin{array}{c} \\ \end{array} \vert \end{array} \end{array} \end{bmatrix}} := \bigvee\_{\substack{i \in \{0..k\} \ \vert \begin{array}{c} \vert \begin{array}{c} \vert \begin{array}{c} \\ \end{array} \vert \begin{array}{c} \\ \end{array} \vert \begin{array}{c} \\ \end{array} \end{array} \end{bmatrix}} (pos\_{\pi,\tau}^{i,j} \wedge p\_{\pi}^{i}) \tag{14}$$

$$\begin{array}{rcl} \left\lVert \neg p\_{\pi,\tau} \right\rVert\_{k,m}^{j} &:=& \bigvee\_{\begin{subarray}{c} i \in \{0,\ldots\} \\ \pi \text{ } \mathrel{\pi} \end{subarray}} \left( p o s\_{\pi,\tau}^{i,j} \wedge \neg p\_{\pi}^{i} \right) \\ & & \tag{15} \\ \end{array} \tag{15}$$

$$\begin{aligned} \left\| \psi\_1 \vee \psi\_2 \right\|\_{k,m}^j := \left\| \psi\_1 \right\|\_{k,m}^j \stackrel{\circ}{\vee} \left\| \psi\_2 \right\|\_{k,m}^j \end{aligned} \tag{16}$$

$$\|\psi\_1 \wedge \psi\_2\|\_{k,m}^j := \|\psi\_1\|\_{k,m}^j \wedge \|\psi\_2\|\_{k,m}^j \tag{17}$$

The halting pessimistic semantics translation uses <sup>J</sup>·Khpes , taking (14)-(17) and (18)-(21) below. For the temporal operators and j < m:

$$\begin{array}{c} \|\psi\_1 \mathcal{U} \psi\_2\|\_{k,m}^j := \neg \alpha \mathcal{U}^j \land \left( \|\psi\_2\|\_{k,m}^j \lor \left( \|\psi\_1\|\_{k,m}^j \land \|\psi\_1 \mathcal{U} \psi\_2\|\_{k,m}^{j+1} \right) \right) \\ \text{v} \quad \bullet \quad \mathsf{m} \quad \mathsf{v} \ \mathsf{j} \quad \mathsf{v} \ \mathsf{n} \ \mathsf{j} \quad \mathsf{v} \ \mathsf{n} \ \mathsf{v} \ \mathsf{j} \quad \mathsf{v} \ \mathsf{n} \ \mathsf{v} \ \mathsf{n} \ \mathsf{n} \end{array} \tag{18}$$

$$\left\| \left\| \psi\_1 \,\mathcal{R} \,\psi\_2 \right\| \right\|\_{k,m}^j := \neg off^j \wedge \left( \left\| \psi\_2 \right\| \right\|\_{k,m}^j \wedge \left( \left\| \psi\_1 \right\| \right\|\_{k,m}^j \vee \left( \left\| \psi\_1 \right\| \,\mathcal{R} \,\psi\_2 \right\| \right\|\_{k,m}^{j+1} \right) \tag{19}$$

For j = m:

$$\begin{array}{ccccc}\|\psi\_1 \mathcal{U} \; \psi\_2\|\_{k,m}^m := \|\psi\_2\|\_{k,m}^m & & & & \\ \|\psi\_1 \mathcal{U} \; \psi\_2 \mathcal{U} \; \ln m & & & \end{array} \tag{20}$$

$$\bar{\|\psi\_1 \mathcal{R} \psi\_2\|}\_{k,m}^m \coloneqq \bar{\left(\left\llbracket \psi\_1 \right\rrbracket \right)\_{k,m}^m} \land \left\llbracket \psi\_2 \right\rrbracket\_{k,m}^m \right) \lor \left(halted^m \land \left\llbracket \psi\_2 \right\rrbracket\_{k,m}^m \right) \tag{21}$$

The halting optimistic semantics translation uses <sup>J</sup>·Khopt, taking (14)-(17) and (18′ )-(21′ ) as follows, For the temporal operators and j < m:

$$\begin{array}{c} \|\psi\_1 \mathcal{U} \psi\_2\|\_{k,m}^j := off^j \vee \left( \|\psi\_2\|\_{k,m}^j \vee \left( \|\psi\_1\|\_{k,m}^j \wedge \|\psi\_1 \mathcal{U} \psi\_2\|\_{k,m}^{j+1} \right) \right. \\\ \|\psi\_1 \mathcal{U} \psi\_2 \|\_{\mathbb{L}} \|\psi\_1 \mathcal{U} \psi\_2\|\_{\mathbb{L}} \vee \left( \|\psi\_1 \mathcal{U} \psi\_2\|\_{\mathbb{L}} \right) \end{array} \tag{18'}$$

$$\begin{aligned} \|\psi\_1 \mathcal{R} \,\psi\_2\|\_{k,m}^j \coloneqq & \boldsymbol{0} \boldsymbol{f}^j \vee \left( \|\boldsymbol{\psi}\_2\|\_{k,m}^j \wedge \left( \|\psi\_1\|\_{k,m}^j \vee \|\psi\_1 \,\mathcal{R} \,\psi\_2\|\_{k,m}^{j+1} \right) \right) \tag{19'} \end{aligned} \tag{19'}$$

For j = m:

$$\begin{array}{l} \left\| \psi\_1 \mathcal{U} \psi\_2 \right\|\_{k,m}^m := \left\| \psi\_2 \right\|\_{k,m}^m \vee \left(halted^m \wedge \left\| \psi\_1 \right\|\_{k,m}^m \right) \\\ \left\| \psi\_1 \mathcal{R} \psi\_2 \right\|\_{k,m}^m := \left\| \psi\_2 \right\|\_{k,m}^m \end{array} \tag{20'}$$

Combining the encodings. Let φ be a A-HLTL formula of the form φ = QAπA. . . . .QZπZ.Qaτa. . . . .Qzτz.ψ. Combining all the components, the encoding of the A-HLTL BMC problem into QBF, for bounds k and m is:

$$\begin{aligned} \left[\mathbb{K}, \varphi\right]\_{k,m} &= \mathbb{Q}\_{A} \overline{x\_{A}} \cdot \dots \cdot \mathbb{Q}\_{Z} \overline{x\_{Z}} . \mathbb{Q}\_{a} \overline{t\_{a}} \cdot \dots \cdot \mathbb{Q}\_{z} \overline{t\_{z}} . \exists \overline{pos}. \, \exists \overline{off}. \end{aligned}$$

$$\left( \left[\mathbb{K}\right]\_{k} \circ\_{A} \cdot \dots \left[\mathbb{K}\right]\_{k} \circ\_{Z} \left( \varphi\_{pos} \wedge enc(\psi) \right) \right)$$

where ◦<sup>A</sup> =→ if Q<sup>A</sup> = ∀ (and ◦<sup>A</sup> =∧ if Q<sup>A</sup> = ∃), and ◦B, . . . are defned similarly. The sets pos is the set of variables posi,j π,τ that encode the positions and of is the set of variables of <sup>j</sup> π,τ that encode when a trace progress has fallen of its unrolling limit. We next defne the encoding enc(ψ) of the temporal formula ψ.

Encoding formulas with up to 1 trajectory quantifer alternations We consider the encoding into QBF of formulas with zero and one quantifer alternation separately. In the following, we say that at position j a collection of trajectories U "moves" whenever either all trajectories have moved all their paths to the halting state, or at least one of the trajectories in U makes one of the non-halted path move at position j. Formally,

$$1 \mapsto \mathop{\rm moves}\limits\_{U}^{j} \overset{\text{def}}{=} \mathop{\rm halted}\limits\_{\tau}^{j} \vee \bigvee\_{\tau \in U, \pi} (t\_{\pi}^{j} \wedge \neg \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \mid \! \! \! \! \! \! \! \! \! \! \! \! \! \mid \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \mid \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \! \!$$

– E <sup>+</sup>U.ψ: In this case, the formula generated for enc(ψ) is

$$(\bigwedge\_{j \in \{0...m\}} moves\_U^j) \wedge \[\psi\|\_{k,m}^0$$

This is correct since the positions at which all trajectories stutter all paths can be removed (obtaining a satisfying path), we can restrict the search to non-stuttering trajectory steps.

– A <sup>+</sup>U.ψ: In this case, the formula generated for enc(ψ) is

$$(\bigwedge\_{j \in \{0...m\}} moves\_U^j) \to \[\psi\}\_{k,m}^0$$

The reasoning is similar as the previous case.

– A <sup>+</sup>UAE <sup>+</sup>UE.ψ: In this case, the formula generated for enc(ψ) is

$$\left(\bigwedge\_{j \in \{0...m\}} moves\_{U\_A}^j\right) \to \left(\bigwedge\_{j \in \{0...m\}} (halted\_{U\_A}^j \to moves\_{U\_E}^j) \wedge \left[\![\psi]\!]\_{k,m}^0\right)$$

Universally quantifed trajectories must explore all trajectories, which must be responded by the existential trajectories. Assume there is a strategy for U<sup>E</sup> for the case that universal trajectories U<sup>A</sup> never stutter at any position. This can be extended into a strategy for the case where U<sup>A</sup> can possible stutter, by adding a stuttering step to the U<sup>E</sup> trajectories at the same position. This guarantees the same evaluation. Therefore, we restrict our search for the outer U<sup>A</sup> to non-stuttering trajectories. Finally, U<sup>E</sup> is obliged to move after U<sup>A</sup> has halted all paths to prevent global stuttering.

– E <sup>+</sup>UEA <sup>+</sup>UA.ψ: In this case, the formula generated for enc(ψ) is similar,

$$\left(\bigwedge\_{j \in \{0...m\}} moves\_{U\_E}^j \right) \wedge \left(\bigwedge\_{j \in \{0...m\}} (halted\_{U\_E}^j \to moves\_{U\_A}^j) \to \|\psi\|\_{k,m}^0 \right)$$

The rationale for this encoding is the following. It is not necessary to explore a non-moving step j for the existentially quantifed trajectories U<sup>E</sup> because if this stuttering step is successful it must work for all possible moves of the U<sup>A</sup> trajectories at the same time step j. This includes the case that all trajectories in U<sup>A</sup> make all paths stutter (which, if we remove j one still has all the legal trajectories for UA). Since the logic does not contain the next operator, the evaluation for the given U<sup>E</sup> and one of the trajectories for U<sup>A</sup> that stutter at j will be the same as for j + 1 for all logical formulas. Therefore, the trajectory that is obtained from removing step j from U<sup>E</sup> is still a satisfying trajectory assignment. It follows that if there is a model for U<sup>E</sup> there is a model that does not stutter. Finally, after all paths have halted according to the U<sup>E</sup> trajectories, a step of U<sup>A</sup> that stutters all paths that have not halted can be removed because, again the evaluation is the same in the previous and subsequent state. It follows that if the formula has a model, then it has a model satisfying the encoding.

Theorem 1. Let φ be an A-HLTL formula with at most one trajectory quantifer alternation, let K be the maximum depth of a Kripke structure and let M = K × |Paths(φ)| × |Trajs(φ)|. Then, the following hold:


Theorem 1 (proof in [13]) provides a model checking decision procedure. An alternative decision procedure is to iteratively increase the bound of the unrollings and invoke both semantics in parallel until the outcome coincides.

## 4 Complexity of A-HLTL Model Checking for Acyclic Frames

Our goal in this section is to analyze the complexity of the A-HLTL model checking problem in the size of an acyclic Kripke structure (all proofs in [13]).

Problem Formulation. We use MC- Fragment to distinguish diferent variations of the problem, where MC is the model checking decision problem, i.e., whether or not K |= φ, and Fragment is one of the following for φ:


The Complexity of A-HLTL Model Checking. We frst show the A-HLTL model checking problem for the alternation-free fragment with only one trajectory quantifer is NL-complete. For example, verifcation of information leak in speculative execution in sequential programs renders a formula of the form ∀ <sup>4</sup>A, which belongs to the alternation-free fragment (more details in Section 5).

Theorem 2. MC- ∃ <sup>+</sup>E and MC- ∀ <sup>+</sup>A are NL-complete.

We now switch to formulas with alternating trace quantifers. The signifcance of the next theorem is that a single trajectory quantifer does not change the complexity of model checking as compared to the classic HyperLTL verifcation [2]. It is noteworthy to mention that several important classes of formulas belong to this fragment. For example, according to Theorem 3 while model checking observational determinism [20] (∀∀E), generalized noninference [16] (∀∀∃E), and non-inference [5] (∀∃E) with a single initial input are all coNP-complete.

Theorem 3. MC- ∃(∃/∀) <sup>+</sup>(A/E) k is Σ p k -complete and MC- ∀(∀/∃) <sup>+</sup>(E/A) k is Π p k complete in the size of the Kripke structure.

We now focus on formulas with multiple trajectory quantifers. We frst show that alternation-free multiple trajectory quantifers bumps the class of complexity by one step in the polynomial hierarchy.

Theorem 4. MC- ∃(∃/∀) <sup>+</sup>EE<sup>+</sup> k is Σ p <sup>k</sup>+1-complete and MC- ∀(∀/∃) <sup>+</sup>AA<sup>+</sup> k is Π p <sup>k</sup>+1 complete in the Kripke structure.

Theorem 5. For k ≥ 1, MC- ∃(∃/∀) <sup>+</sup>A <sup>+</sup>E + k is Σ p <sup>k</sup>+1-complete and MC- ∀(∀/∃) <sup>+</sup>E <sup>+</sup>A + k is Π p <sup>k</sup>+1-complete in the size of the Kripke structure.

Finally, Theorems 3, 4, and 5 imply that the model checking problem for acyclic Kripke structures and A-HLTL formulas with an arbitrary number of trace quantifer alternation and only one trajectory quantifer is in PSPACE.

## 5 Case Studies and Evaluation

We now evaluate our technique. The encoding in Section 3 is implemented on top of the open-source bounded model checker HyperQB [15]. All experiments are executed on a MacBook Pro with 2.2GHz processor and 16GB RAM (https: //github.com/TART-MSU/async hltl tacas23).

Non-interference in Concurrent Programs. We frst consider the programs presented earlier in Figs. 1 and 3 together with A-HLTL formulas φNI and φNInd from Section 1. We receive UNSAT (for the original formula and not its negation), which indicates that violations have been spotted. Indeed, our implementation successfully fnds a counterexample with a specifc trajectory that prints out 'acdb' when the high-security value h is equal to zero (entries of ACDB and ACDBndet in Table 3). Our other experiment is an extension of the example in [10] for multiple asynchronous channels (see Fig. 6) and the following formula: φODnd = ∀π.∀π ′ .Aτ. Eτ ′ . (lπ,τ ↔ lπ′ ,τ ) → (obsπ,τ′ ↔ obsπ′ ,τ′ ). The results for this case are entries of ConcLeak and ConcLeakndet in Table 3. Details of the counterexample can be found in [13].

Fig. 6: Program with nondeterministic sequence of inputs.

Speculative Information Flow. Speculative execution is a standard optimization technique that allows branch prediction by the processor. Speculative noninterference (SNI) [9] requires that two executions with the same policy p (i.e., initial confguration) can be observed diferently in speculative semantics (e.g., a possible branch), if and only if their non-speculative semantics with normal condition checks are also observed diferently; i.e., the following A-HLTL formula:

$$\begin{split} \varphi\_{\mathsf{SNI}} &= \underbrace{\forall \pi\_{1}. \forall \pi\_{2}.}\_{\text{speculate}} \quad \underbrace{\forall \pi'\_{1}. \forall \pi'\_{2}}\_{\text{non-specific}} \quad \mathsf{A}\tau. \left(\square(\mathsf{obs}\_{\pi\_{1},\tau} \leftrightarrow \mathsf{obs}\_{\pi\_{2},\tau})\right) \wedge \\ & \left(\mathsf{p}\_{\pi\_{1},\tau} \leftrightarrow \mathsf{p}\_{\pi\_{2},\tau}\right) \wedge \left(\mathsf{p}\_{\pi\_{1},\tau} \leftrightarrow \mathsf{p}\_{\pi'\_{1},\tau}\right) \wedge \left(\mathsf{p}\_{\pi\_{2},\tau} \leftrightarrow \mathsf{p}\_{\pi'\_{2},\tau}\right) \right) \rightarrow \square\left(\mathsf{obs}\_{\pi'\_{1},\tau} \leftrightarrow \mathsf{obs}\_{\pi'\_{2},\tau}\right). \end{split}$$

where obs is the memory footprint, traces π<sup>1</sup> and π<sup>2</sup> range over the (nonspeculative) C code and traces π ′ <sup>1</sup> and π ′ 2 range over the corresponding (speculative) assembly code. We evaluate SNI on the translation from a C program (details in [13]), where y is the input policy p and multiple versions of x86 assembly code [9]. The results of model checking speculative execution are in Table 3 (see entries from SpecExcu<sup>V</sup> <sup>1</sup> to SpecExcu<sup>V</sup> <sup>7</sup> ). Additional versions from SpecExcu<sup>V</sup> <sup>3</sup> to SpecExcu<sup>V</sup> <sup>7</sup> are under diferent compilation options. Our method correctly identify all the insecure and secure ones as stated in [9].

Compiler Optimization Security. Secure compiler optimization [17] aims at preserving input-output behaviors of a source program (original implementation) and a target program (after applying optimization), including security policies. We investigate the following optimization strategies: Dead Branch Elimination (DBE), Loop Peeling (LP), and Expression Flattening (EF). To verify a secure optimization, we consider two scenarios: (1) one single I/O event (one trajectory, similar to [1]), and (2) a sequences of I/O events (two trajectories):

$$\begin{split} \varphi\_{\mathsf{SC}} &= \forall \pi. \forall \pi'. \mathsf{E}\tau. \ (\mathsf{in}\_{\pi,\tau} \leftrightarrow \mathsf{in}\_{\pi',\tau}) \to \mathsf{D}\left(\mathsf{out}\_{\pi,\tau} \leftrightarrow \mathsf{out}\_{\pi',\tau}\right) \\ \varphi\_{\mathsf{SC}\_{\mathsf{sd}}} &= \forall \pi. \forall \pi'. \mathsf{A}\tau. \ \mathsf{E}\tau'. \mathsf{D}\left(\mathsf{in}\_{\pi,\tau} \leftrightarrow \mathsf{in}\_{\pi',\tau}\right) \to \mathsf{D}\left(\mathsf{out}\_{\pi,\tau'} \leftrightarrow \mathsf{out}\_{\pi',\tau'}\right), \end{split}$$

where in is the set of inputs and out is the set of outputs. Table 3 (cases DBE – EFLPndet) shows the verifcation results of each optimization strategy and diferent combination of the strategies (details in [13]).

Cache-Based Timing Attacks. Asynchrony also leads to attacks when system executions are confned to a single CPU and its cache [18]. A cache-based timing attack happens when an attacker is able to guess the values of high-security variables when cache operations (i.e., evict, fetch) infuence the scheduling of diferent threads. Our case study is inspired by the cache-based timing attack example in [18] and we use the formula of observational determinism φODnd introduced earlier in this section to fnd the potential attacks (see cases of CacheTA and CacheTAndet in Table 3 with details in [13]).

#### 5.1 Analysis of Experimental Results

Table 3 presents the diameter of the transition relation, length of trajectories m, state spaces, and the number of trajectory variables. We also present the total solving time of our algorithm as well as the break down: generating models (genQBF), building trajectory encodings (buildTr), and fnal QBF solving (solveQBF). Our two most complex cases are concurrent leak (ConcLeakndet) and loop peeling (LPndet). For concurrent leak, it is because there are three threads with many interleavings (i.e., asynchronous composition), takes longer time to build. For loop peeling, although there is no need to consider interleavings except for the nondeterministic inputs; however, the diameters of traces (D<sup>K</sup><sup>1</sup> , D<sup>K</sup><sup>2</sup> ) are longer than other cases, which makes the length and size of trajectory variables (i.e., m and |T|) grow and increases the total solving time.

Our encoding is able to handle a variety of cases with one or more trajectories, depending on whether multiple sources of non-determinism is present. To see efciency, we compare the solving time for cases of compiler optimization with one trajectory with the results in [1]. This


Table 2: Comparison of model checking compiler optimization with [1].

method reduces A-HLTL model checking to HyperLTL model checking for limited fragments and utilizes the model checker MCHyper. On the other hand, we directly handle asynchrony by trajectory encoding. Table 2 shows our algorithm considerably outperforms the approach in [1] in larger cases.


Table 3: Case studies break down for Kripke structures: K1, K<sup>2</sup> (all case studies have two, e.g.,one for high-level and one for assembly code), formula: φ, diameter: D, state space: |S|, trajectory depth: m, and size of trajectory variables: |T|.

## 6 Conclusion and Future Work

In this paper, we focused on the problem of A-HLTL model checking for terminating programs. We generalized A-HLTL to allow nested trajectory quantifcation, where a trajectory determines how diferent traces may advance and stutter. We rigorously analyzed the complexity of A-HLTL model checking for acyclic Kripke structures. The complexity grows in the polynomial hierarchy with the number of quantifer alternations, and, it is either aligned with that of HyperLTL or is one step higher in the polynomial hierarchy. We also proposed a BMC algorithm for A-HLTL based on QBF-solving and reported successful experimental results on verifcation of information fow security in concurrent programs, speculative execution, compiler optimization, and cache-based timing attacks.

Asynchronous hyperproperties enable logic-based verifcation for software programs. Thus, future work includes developing diferent abstraction techniques such as predicate abstraction, abstraction-refnement, etc, to develop software model checking techniques. We also believe developing synthesis techniques for A-HLTL creates opportunities to automatically generate secure programs and assist in areas such as secure compilation.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Model Checking Linear Dynamical Systems under Floating-point Rounding<sup>⋆</sup>

Engel Lefaucheux1() , Joël Ouaknine<sup>2</sup> , David Purser<sup>3</sup>,<sup>4</sup> , and Mohammadamin Sharif<sup>5</sup>

<sup>1</sup> University of Lorraine, CNRS, Inria, LORIA, Nancy, France engel.lefaucheux@inria.fr

<sup>2</sup> Max Planck Institute for Software Systems, Saarland Informatics Campus,

Saarbrücken, Germany

joel@mpi-sws.org

<sup>3</sup> University of Warsaw, Warsaw, Poland

<sup>4</sup> University of Liverpool, Liverpool, UK

D.Purser@liverpool.ac.uk <sup>5</sup> Sharif University of Technology, Tehran, Iran sharifim689@gmail.com

Abstract. We consider linear dynamical systems under foating-point rounding. In these systems, a matrix is repeatedly applied to a vector, but the numbers are rounded into foating-point representation after each step (i.e., stored as a fxed-precision mantissa and an exponent). The approach more faithfully models realistic implementations of linear loops, compared to the exact arbitrary-precision setting often employed in the study of linear dynamical systems.

Our results are twofold: We show that for non-negative matrices there is a special structure to the sequence of vectors generated by the system: the mantissas are periodic and the exponents grow linearly. We leverage this to show decidability of ω-regular temporal model checking against semialgebraic predicates. This contrasts with the unrounded setting, where even the non-negative case encompasses the long-standing open Skolem and Positivity problems.

On the other hand, when negative numbers are allowed in the matrix, we show that the reachability problem is undecidable by encoding a two-counter machine. Again, this is in contrast with the unrounded setting where pointto-point reachability is known to be decidable in polynomial time.

Keywords: Model Checking · Floating-point · Dynamical Systems.

## 1 Introduction

Loops are a fundamental staple of any programming language, and the study of loops plays a pivotal role in many subfelds of computer science, including automated verifcation, abstract interpretation, program analysis, semantics, etc. The focus of the present paper is on the algorithmic analysis of simple (i.e., nonnested) linear (or afne) while loops, such as the following:

<sup>⋆</sup> A long version of this paper is available as [19].

```
x = 3, y = 4, z = 2
while x+3y+z > 4:
    x = 3x +2z
    y = 3x + y
    z = y + z
```
We are interested in analysing how the loop evolves. A simple reachability query is to decide whether the loop variables ever satisfy a Boolean combination of polynomial inequalities, for example modelling a loop guard. More generally, one might seek to consider signifcantly more complex temporal properties, such as those expressible in linear temporal logic or monadic second-order logic: this gives rise to a model-checking problem.

Modelling the evolution of such a loop may require unbounded memory. That is, the number of bits needed to represent the numbers x, y, and z may grow larger and larger. However, most computer systems do not represent rational numbers to arbitrary precision, but rather use foating-point rounding, in which a number y is stored using two components: the mantissa m ∈ Q and the exponent α ∈ Z, such that y = m · 10<sup>α</sup>. 6

Typically foating-point numbers are specifed using either 32 or 64 bits, with some of these reserved for the mantissa and some for the exponent, thus bounding both the mantissa and the exponent. We do not do this, and only place a bound on the number of bits representing the mantissa, allowing the exponent to grow unboundedly (in either direction). From a theoretical standpoint, bounding the number of bits of both the mantissa and the exponent would necessarily give rise to a fnite-state system, for which essentially any decision problem would become decidable (at least in principle, if not necessarily in practice). Due to the unboundedness of exponents in our setting, we do not have to consider overfows ('NaN', 'infnity' or '-infnity' which are part of most foating-point specifcations).

Formally, we model our programs using linear dynamical systems (LDS), which comprise a starting vector representing the initial state of each variable and a matrix describing the evolution of the program. An LDS generates an infnite sequence of vectors (the orbit of the system) by multiplying the matrix with the current vector and then applying foating-point rounding to the result.

### Our results

We consider the model-checking problem for linear dynamical systems evolving under foating-point rounding. More formally, let Y1, . . . , Y<sup>k</sup> ⊆ R <sup>d</sup> be semialgebraic targets. Given an orbit (x (t) )t∈<sup>N</sup>, we defne the characteristic word w = w1, w2, w3, . . . with respect to Y1, . . . , Y<sup>k</sup> over alphabet 2 {1,...,k} such that i ∈ w<sup>t</sup> if and only if x (t) ∈ Y<sup>i</sup> . The model-checking problem asks whether w is in an ω-regular language, or equivalently satisfes a temporal specifcation given in monadic second-order logic (MSO).

<sup>6</sup> We work in base 10 throughout for simplicity of exposition. All our results carry over mutatis mutandis in any integer base, including base 2 as typically used in practice.

Our results show that analysing LDS under foating-point rounding is neither clearly easier nor harder than in the standard setting (without rounding). Our frst contribution establishes undecidability of point-to-point reachability (and a fortiori model checking) under foating-point rounding, a surprising outcome given that point-to-point reachability is solvable in polynomial time without rounding [16]. On the other hand, in the standard setting neither decidability nor undecidability are known for full model checking (although mathematical hardness results exist); see [24,18,17].

Theorem 1. The foating-point point-to-point reachability problem is undecidable.

However, for non-negative matrices, we show that the full MSO modelchecking problem is decidable in our setting, without restrictions on the dimensions of the predicates or the ambient space. This is in stark contrast to the standard setting, where assuming non-negativity does not simplify the problem. Model checking non-negative LDS without rounding would require (at a minimum) solving the longstanding open Skolem and Positivity problems [2].

Theorem 2. Let (M, x) be a non-negative linear dynamical system, let Y1, . . . , Y<sup>k</sup> be semialgebraic targets and let ϕ be an MSO formula using predicates over Y1, . . . , Yk. It is decidable whether the characteristic word under foating-point rounding satisfes ϕ.

We place no dimension restriction on the predicates; in particular, showing that the Skolem and Positivity problems are decidable on non-negative systems under foating-point rounding. At this time we do not however have complexity upper bounds on our model-checking algorithm, or lower bounds on the modelchecking problem.

#### Related work

There is a line of practical tools for the analysis, verifcation, and invariant synthesis for foating-point loops [7,20,1,22]. These tools typically work well in practice, but do not necessarily work in all cases. The analysis of concrete implementations of foating-point specifcations requires careful analysis of edge cases around ±∞ and 'NaN'. In contrast to these tools which focus primarily on practical analysis, our work seeks to understand the theoretical possibilities and limitations of the exact analysis of (possibly long-running) foating-point loops in a generalised setting.

The study of linear dynamical systems explores the sequence of vectors induced by a matrix. Model checking is only known to be decidable for certain classes of semialgebraic predicates—in particular those with low dimension [18] or for prefx-independent properties [4]; see also [17]. The well-known Skolem and Positivity problems being special cases of model checking, they place technical limits on the dimensions that can be handled without frst resolving longstanding open cases of these problems. Recent progress suggests that the Skolem

problem may be yet be conquered, at least for diagonalisable matrices [8,21], but Positivity requires solving particularly difcult problems in analytic number theory [24,12]. The non-negative case can be used to model sequences of distributions induced by Markov chains [6], although all hardness limitations apply already in the probabilistic setting [2].

Baier et al. [5] consider LDS under rounding to fxed-decimal precision, showing reachability is PSPACE-complete for hyperbolic systems (when no eigenvalue has modulus one) and decidable for certain other constrained classes of rounding. A notable diference of fxed-decimal precision is that it cannot allow arbitrarily small numbers, unlike the foating-point numbers we consider.

A recent line of work focusses on linear dynamical systems with perturbations at every step, with a view to understanding the robustness of reachability problems [13,14,3]. However, unlike rounding, the perturbation is chosen in order to assist hitting the target and the perturbation is arbitrarily small.

For linear while loops the reachability problem can be rephrased as a halting problem, asking whether a guard condition is eventually met from a given initial state. The related termination problem asks whether a guard condition is met from every initial state [26,10]. Issues arising from implementations using foating-point representations to solve the termination problem of unrounded (arbitrary precision) loops are considered in [27]. In contrast, we are interested in analysing programs in which the intended behaviour is to round the numbers to fxed-precision foating-point numbers at every step of the loop.

Organisation In Section 2, we formalise the model and problems and discuss some of the properties of foating-point rounding. In Section 3, we present our undecidability result for the general case. Finally, in Section 4 we establish some special periodic structure associated with the orbit and use this structure in Section 5 to show that model checking is decidable for non-negative LDS.

## 2 Preliminaries

#### 2.1 Linear dynamical systems and rounding functions

Defnition 1. A d-dimensional linear dynamical system (LDS) (M, x) comprises a matrix M ∈ Qd×<sup>d</sup> and an initial vector x ∈ Q<sup>d</sup> .

Given a rounding function [·] : Q<sup>d</sup> → Q<sup>d</sup> , and an LDS (M, x) the rounded orbit O is the sequence (x (t) )t∈<sup>N</sup> such that x (0) = [x] and x (t) = [Mx(t−1)] for all t ≥ 1.

Given p ∈ N, we say that a number x is a foating-point number with precision p if x = m · 10<sup>α</sup> such that m ∈ Q is a decimal number in {0} ∪ [0.1, 1) with p digits in the fractional part (after the decimal point) and α ∈ Z. In particular, we associate by convention the number with mantissa m = 0 to the exponent −∞. Given a number x = m · 10<sup>α</sup> we defne mantissa(x) = m and exponent(x) = α.

We are interested in the foating-point rounding function [·] with precision p ∈ N. Given a real number x ∈ R, we defne [x], the foating-point rounding of x, as the closest foating-point number with precision p based on the frst p + 1 digits of x.

Where there are two possible choices, any deterministic choice that is consistent with the properties listed below is acceptable.<sup>7</sup> We denote by FP10[p] the subset of Q representable in base 10 as a foating-point numbers with p digits. We use the following useful properties of the rounding function:


The foating-point rounding is defned above on a single real. It is extended straightforwardly to a vector x by applying it to each of its components (x)<sup>i</sup> where i ranges from 1 to the dimension of the vector. As such, the term [Mx] is obtained by frst computing exactly the the vector Mx and then by rounding each component (Mx)<sup>i</sup> . An alternative approach could be to maintain each subcomputation in p-bits of precision, but this is not the approach we take. Such an orbit can be simulated in our setting by increasing the dimension so that operations can be staggered in a way that at most one operation (scalar product or variable addition) is used in each assignment.

#### 2.2 Model checking

We consider the model-checking problem of an LDS over semialgebraic sets.

Defnition 2. A semialgebraic set Y ⊆ R d is defned by a fnite Boolean combination of polynomial inequalities.

Let (M, x) be an LDS with rounded orbit O and Y = {Y1, . . . , Yk} be a collection of semialgebraic sets. The characteristic word of O is w = w1w2w<sup>3</sup> . . . ∈ (2{1,...,k} ) <sup>ω</sup>, such that j ∈ w<sup>t</sup> if and only if x (t) ∈ Y<sup>j</sup> .

The model-checking problem asks whether the characteristic word is contained within a given ω-regular language, usually specifed in a temporal logic such as monadic second order logic (MSO), or often its LTL fragment. Without loss of generality we assume that the property is given as a Büchi automaton [11].

Problem 1 (Floating-point Model-checking Problem). Given an LDS (M, x) with rounded orbit O, a collection of semialgebraic sets Y = {Y1, . . . , Yk} and an ωregular specifcation ϕ, the model-checking problem consists in deciding whether the characteristic word w of O satisfes the specifcation ϕ.

<sup>7</sup> For example, always rounding up, always rounding down, round to even, rounding towards zero, rounding away from zero are acceptable, providing the choice is fxed.

We will also consider the point-to-point reachability problem, which is a subcase of the model-checking problem (Problem 1):

Problem 2 (Floating-point Point-to-point Reachability Problem). Given a ddimensional LDS (M, x), and a target vector y ∈ Q<sup>d</sup> , the point-to-point reachability problem consists in deciding whether y belongs to the rounded orbit O.

Given a target Y ⊆ R d , we associate the set of hitting times Z(Y ) = {t | x (t) ∈ Y }. Under this formulation, the reachability problem is reformulated as whether Z(Y ) is empty. However, for model checking we will develop a more comprehensive understanding of the hitting times of each target Y1, . . . , Yk.

## 2.3 Structure of M

Formally, M is a d-dimensional matrix indexed by the elements {1, . . . , d}. However, we interpret M as an automaton over states Q = {q1, . . . , qd} and reference the entries of M by pairs of states. That is, we refer to M<sup>q</sup>1,q<sup>2</sup> rather than M1,2.

We denote by G<sup>M</sup> the weighted directed graph whose adjacency matrix is M. That is, a graph with vertices Q and with an edge from q<sup>j</sup> to q<sup>i</sup> weighted by M<sup>q</sup>i,q<sup>j</sup> if M<sup>q</sup>i,q<sup>j</sup> ̸= 0. 8

Let S1, · · · , S<sup>s</sup> ⊆ Q be the strongly connected components (SCCs) of GM. Our analysis will consider each strongly connected component separately, thus it will often be useful to consider the entries of x ∈ FP10[p] <sup>Q</sup> corresponding only to one strongly connected component. Without loss of generality, by reordering the states where necessary, we assume that the states in Q are ordered so that states within the same SCC appear next to one another, and the strongly connected components are topologically sorted, i.e. there is no edge from S<sup>i</sup> to S<sup>j</sup> where i > j. We split a vector x into s smaller vectors, denoted x<sup>S</sup><sup>1</sup> , . . . , x<sup>S</sup><sup>s</sup> , each representing the entries of x corresponding to the SCC. Letting x<sup>S</sup><sup>j</sup> = (z1,j , · · · , z<sup>d</sup><sup>j</sup> ,j ) <sup>T</sup> and |S<sup>j</sup> | = d<sup>j</sup> , we thus have x is partitioned as

$$x = (z\_{1,1} \cdots z\_{d\_1,1}, \cdots, z\_{1,s} \cdots z\_{d\_s,s})^T.$$

Moreover, for each pair of SCCs S<sup>i</sup> , S<sup>j</sup> , we denote by M<sup>S</sup>i,S<sup>j</sup> the submatrix of M restricted to the rows related to S<sup>i</sup> and columns related to S<sup>j</sup> , which is a matrix with d<sup>i</sup> rows and d<sup>j</sup> columns. If S<sup>i</sup> = S<sup>j</sup> , we simply write M<sup>S</sup><sup>i</sup> . In other words, M<sup>S</sup>i,S<sup>j</sup> is the matrix that shows the dependency between S<sup>i</sup> and S<sup>j</sup> , and we have

$$M = \begin{pmatrix} M\_{S\_1} & M\_{S\_1, S\_2} & \cdots & M\_{S\_1, S\_s} \\ M\_{S\_2, S\_1} & M\_{S\_2} & \cdots & M\_{S\_2, S\_s} \\ \vdots & \vdots & \ddots & \vdots \\ M\_{S\_s, S\_1} & M\_{S\_s, S\_2} & \cdots & M\_{S\_s} \end{pmatrix}$$

We say S<sup>i</sup> feeds S<sup>j</sup> , and S<sup>j</sup> is fed by S<sup>i</sup> if there is some edge in G<sup>M</sup> from some state in S<sup>i</sup> to some state in S<sup>j</sup> .

<sup>8</sup> Note that the orientation of the edge may appear switched from the reader's expectation. This is due to the convention that M is pre-multiplied with x at every step.

## 3 Undecidability of point-to-point reachability

In this section, we give a sketch of the proof of the undecidability of Problem 2 (and thus of Problem 1) in the general case. The full proof can be found in the long version of this paper [19].

Theorem 1. The foating-point point-to-point reachability problem is undecidable.

This result is obtained by reduction from the termination of a two-counter Minsky machine. We recall the defnition of this model:

Defnition 3. A two-counter Minsky machine is defned by a fnite set of states ℓ1, . . . , ℓm, a distinguished starting state (w.l.o.g. ℓ1), a distinguished halting state (w.l.o.g. ℓm), two natural integer counters, here denoted as x and y, and a mapping deterministically associating to each state transition a particular action. Each transition takes one of the following forms: for z ∈ {x, y},

increment incz(ℓ<sup>j</sup> ): add 1 to counter z, move to state ℓ<sup>j</sup> . decrement decz(ℓ<sup>j</sup> ): remove 1 from counter z if z > 0, move to state ℓ<sup>j</sup> . zero test zero?z(ℓ<sup>j</sup> , ℓk): if z = 0 move to state ℓ<sup>j</sup> else move to state ℓk.

The confguration of a two-counter Minsky machine consists of the current state and the values of x and y.

Without loss of generality (by frst using a zero test), one can assume a decrementation operation is never used in a confguration where the counter to be decreased has value 0, hence removing the need to check whether z > 0.

The halting problem asks whether, starting in confguration (ℓ1, 0, 0), that is, in the distinguished starting state with both counters set to 0, whether the state ℓ<sup>m</sup> is reached. The problem is undecidable [23].

We build an LDS with mantissa length p = 1 and base 10 that simulates a run of a given Minsky machine. The reduction happens to maintain the invariant that each mantissa always has the value 0 or 1 after rounding (although, as we operate in base 10, there are 10 possible values the mantissa could have taken). For ease of readability, we describe this LDS using variables to represent the dimensions and linear functions to represent the transition matrix. For each state of the Minsky machine, we use two variables corresponding to the two counters. Throughout the simulation, if the Minsky machine is in state j, the counter values are stored in the exponents of the variables associated with state j, and all other variables are zero.

The crux of our reduction lies in the handling of the zero test. More precisely, suppose we need to branch depending on whether x is equal to 0, then we need to defne linear transitions that transfer the values of the two counters from one pair of variables to the appropriate new pair of variables. This is done using flter functions: the function flter+(u, v) (resp. flter−(u, v)) is equal to v if v ≥ u (resp. v < u) and to 0 otherwise. We end this sketch with the construction of these functions and proof that they operate as advertised.

Lemma 1. Given u, v of the form 10<sup>c</sup> with c ∈ N, one can compute the value w = flter+(u, v) in three linear operations with foating-point rounding.

Proof. We compute w = flter+(u, v) in three successive operations using two temporary variables, temp and temp2, initially set at 0 (recall, rounding is applied after each step):

temp ← u + v temp2 ← temp − u w ← 1.1 ∗ temp2 Let c1, c<sup>2</sup> ∈ N such that u = 10<sup>c</sup><sup>1</sup> and v = 10<sup>c</sup><sup>2</sup> . Recall that the notation [·] is the foating-point rounding function. First observe that if c<sup>1</sup> = c2: temp ← [10<sup>c</sup><sup>1</sup> + 10<sup>c</sup><sup>2</sup> ] = 2 · 10<sup>c</sup><sup>1</sup> temp2 ← [2 · 10<sup>c</sup><sup>1</sup> − 10<sup>c</sup><sup>1</sup> ] = 10<sup>c</sup><sup>1</sup> (= v) w ← [1.1 · 10<sup>c</sup><sup>1</sup> ] = 10<sup>c</sup><sup>1</sup> = v as required. Secondly, assume that u > v, and thus c<sup>1</sup> > c2: temp ← [10<sup>c</sup><sup>1</sup> + 10<sup>c</sup><sup>2</sup> ] = 10<sup>c</sup><sup>1</sup> = u temp2 ← [10<sup>c</sup><sup>1</sup> − 10<sup>c</sup><sup>1</sup> ] = 0 w ← [1.1 · 0] = 0 as required. We split the case that v > u, thus c<sup>2</sup> > c1, into two cases. Suppose c<sup>2</sup> > c<sup>1</sup> + 1: temp ← [10<sup>c</sup><sup>1</sup> + 10<sup>c</sup><sup>2</sup> ] = 10<sup>c</sup><sup>2</sup> = v temp2 ← [10<sup>c</sup><sup>2</sup> − 10<sup>c</sup><sup>1</sup> ] = [0. 99 . . . 99 | {z } c2−c1≥2 ·10<sup>c</sup><sup>2</sup> ] = 1 · 10<sup>c</sup><sup>2</sup> = v w ← [1.1 · 10<sup>c</sup><sup>2</sup> ] = 10<sup>c</sup><sup>2</sup> = v as required. Finally, c<sup>2</sup> = c<sup>1</sup> + 1: temp ← [10<sup>c</sup><sup>1</sup> + 10<sup>c</sup><sup>2</sup> ] = 10<sup>c</sup><sup>2</sup> = v temp2 ← [10<sup>c</sup><sup>2</sup> − 10<sup>c</sup><sup>1</sup> ] = [0.9 · 10<sup>c</sup><sup>2</sup> ] = 9 · 10<sup>c</sup>2−<sup>1</sup> w ← [1.1 · 9 · 10<sup>c</sup>2−<sup>1</sup> ] = [9.9 · 10<sup>c</sup>2−<sup>1</sup> ] = 10 · 10<sup>c</sup>2−<sup>1</sup> = 10<sup>c</sup><sup>2</sup> = v as required. ⊓⊔

Corollary 1. Given u, v of the form 10<sup>c</sup> with c ∈ N, one can compute the value w = flter−(u, v) in four linear operations with foating-point rounding.

Proof. Observe that flter−(u, v) = v − flter+(u, v), which can be encoded in four steps by frst computing flter+(u, v) in three steps. ⊓⊔

## 4 Pseudo-periodic orbits of non-negative LDS

We shift our focus to proving that model checking is decidable for systems with non-negative matrices. We frst establish the behaviour of the system in this section and then complete the proof of Theorem 2 in Section 5. Our main result is that the rounded orbit of an LDS is periodic in the following sense, which we call pseudo-periodic.

Defnition 4. A sequence (x (t) )i∈<sup>N</sup> of d-dimensional vectors of foating-point numbers is called pseudo-periodic if and only if there exists a starting point N ∈ N, period T ∈ N and growth rates α1, . . . , α<sup>d</sup> ∈ Z such that

$$\forall t \ge N, \forall j \in \{1, \dots, d\}, (x^{(t+T)})\_j = 10^{\alpha\_j} (x^{(t)})\_j.$$

We say the sequence is efectively pseudo-periodic if the defning constants N, T, α1, . . . , α<sup>d</sup> can be computed.

Theorem 3. Let (M, x) be a d-dimensional LDS where M is non-negative and let (x (t) )t∈<sup>N</sup> be its rounded orbit.

The rounded orbit (x (t) )t∈<sup>N</sup> is efectively pseudo-periodic.

In order to establish this result, we will fnd some partitions of the graph associated to M such that each part is efectively pseudo-periodic with the same increasing rate α for every state in the partition.

#### 4.1 Preprocessing periodicity

The core of our approach is to show that, within each SCC of the graph associated to M, the values associated with states are of similar magnitude. This is however only true if the SCC is aperiodic. When a state is in a periodic SCC its value could change drastically depending on which phase the system is in. For example, consider a simple alternation between two states, in which the value is very large in one state and very small in the other; the states will alternate between big and small values.

We "hide" these periodic behaviours by blowing up the system so that each SCC of the new system describes only one of the periodic subsequence and we will subsequently show that the value of each state in an SCC is either zero or of a similar magnitude.

We apply the following construction to our system. Let P be the period, defned as the least common multiple of the length of every simple cycle in the graph. Let Q be the indices of M (i.e. the states of the generated automaton). We defne new states Q′ = Q × {0, . . . , P − 1} by annotating each state in Q with the phase. To avoid cluttering notation we will regularly refer to states in Q′ in the form (q, i + ℓ) for ℓ ∈ Z, on the understanding that the phase, i + ℓ, is normalised into {0, . . . , P − 1} by taking the residue modulo P if necessary. We defne a new matrix M′ over the states Q′ such that M′ (q,i+1),(q ′ ,i) = Mq,q′ for i ∈ {0, . . . , P − 1}, and zero otherwise. We initialise a new starting vector x (0) (q,0) = x (0) <sup>q</sup> and x (0) (q,i) = 0 for i ∈ {1, . . . , P − 1}.

Intuitively, at each time step t the vector generated by the original system is equal to the vector of the new system restricted to the states indexed by i ≡ t mod P and every state with another index is equal to 0.

Let S ⊆ Q be a strongly connected component. In Q′ there exists strongly connected components S ′ 1 , . . . , S′ <sup>k</sup> <sup>⊆</sup> <sup>Q</sup>′ with <sup>k</sup> ≤ |S<sup>|</sup> such that <sup>S</sup><sup>k</sup> <sup>i</sup>=1 S ′ <sup>i</sup> = S × {0, . . . , P − 1}. Each set S ′ j is periodic, with period P.

Henceforth in the rest of this section we work on the system (M′ , x′ ) implicitly over states Q′ which, by overloading of notation, we rename (M, x) over Q to avoid cluttering notation.

Note that this transformation also requires to marginally complicate the targets. Indeed, consider a set Y ⊆ R <sup>Q</sup>. We defne the sets Y /i for i < P such that Y /i = {y ∈ R Q′ | ∃y ′ ∈ Y : y(q,i) = y ′ q for q ∈ Q and y ′ (q,j) = 0 for j ≠ i}. The hitting times of Y , Z(Y ), in the original LDS can then be obtained in the new LDS as the disjoin union: S <sup>i</sup>∈{0,...,P <sup>−</sup>1} Z(Y /i). It sufces to characterise the hitting times for each Y /i.

## 4.2 Pseudo-periodicity within top SCCs

Let us frst consider top SCCs, these are SCCs with no incoming edges from states of other SCC, and therefore the value of each variable at each step depends only on the value of states in the same SCC.

Lemma 2. Let S<sup>j</sup> be a strongly connected component of (M, x). Let Sj,i = {(q, i) ∈ Sj} be the states associated with S<sup>j</sup> from the i-th phase. There exists C ≤ P d<sup>2</sup> , such that, for every i, j, (M<sup>C</sup> )<sup>S</sup>j,i is positive.

Proof. The matrix (M<sup>P</sup> )<sup>S</sup>j,i is non-negative, irreducible (i.e., its graph is strongly connected) and of period 1. As such, (M<sup>P</sup> )<sup>S</sup>j,i is primitive [9] which means that a power C ′ of this matrix is positive. The theorem follows with C = P C′ . Moreover, C ′ is at most d <sup>2</sup> − 2d + 2 [25]. ⊓⊔

Our goal is to show that within an SCC, each of the non-zero entries are of a similar magnitude due to the presence of a relatively short path (C) between any two states in the SCC. To do this we introduce the notion of closeness and observe some useful properties.

Defnition 5. We say two numbers x, x′ ∈ FP10[p] are δ-close, denoted by x ≈<sup>δ</sup> x ′ if |exponent(x) − exponent(x ′ )| < δ. In particular, for every δ > 0, zero is assumed to be δ-close only to itself.

We extend the notion to vectors y, y ∈ FP10[p] <sup>S</sup>, indexed by S ⊆ Q, such that y ≈<sup>δ</sup> y ′ if all entries of the same phase are δ-close to one another across both y and y ′ , that is, for each phase i ∈ {0, . . . , P − 1} and all (q, i),(q ′ , i) ∈ S: y(q,i) ≈<sup>δ</sup> y ′ (q ′ ,i) , y(q,i) ≈<sup>δ</sup> y(<sup>q</sup> ′ ,i) and y ′ (q,i) ≈<sup>δ</sup> y ′ (q ′ ,i) .

Proposition 1. Let x, x′ ∈ FP10[p] be non-zero foating-point numbers.

(1) If x ≈<sup>δ</sup> x ′ then 10<sup>−</sup>δ−<sup>1</sup> ≤ x/x′ ≤ 10<sup>δ</sup>+1 . (2) If 10<sup>−</sup><sup>δ</sup> ≤ x/x′ ≤ 10<sup>δ</sup> then x ≈δ+2 x ′ . (3) If x ≈<sup>δ</sup> x ′ and x ′ ≈<sup>η</sup> x ′′ then x ≈δ+η+4 x ′′ .

Lemma 3. Let S<sup>j</sup> be a top strongly connected component of (M, x), and let C be as given by Lemma 2.

There exists β ∈ N such that for all (q, i),(q ′ , i) ∈ S<sup>j</sup> and every t ≥ C then

– if t ̸≡ i mod P, then x (t) (q,i) = 0, – otherwise, x (t) (q,i) ≈<sup>β</sup> x (t) (q ′ ,i) .

Proof. Let t ∈ N. If t ̸≡ i mod P then x (t) (q,i) = 0 for all (q, i) ∈ Sj,i by construction.

Otherwise, let m ≥ max q,q′∈Q:Mq,q′=0 ̸ max Mq,q′ ,(Mq,q′ ) −1 be a constant larger than all values occurring in M and so that <sup>1</sup> <sup>m</sup> is smaller than all non-zero values appearing in M. Let c be the constant from the log bounded property of the rounding function [·] and d be the dimension of M.

Observe that for all t ∈ N with t = i mod P we have

$$\begin{aligned} x\_{(q,i)}^{(t)} &= \left[ \sum\_{(q',i-1)} M\_{(q,i),(q',i-1)} x\_{(q',i-1)}^{(t-1)} \right] \\ &\geq \frac{1}{c} \sum\_{(q',i-1)} M\_{(q,i),(q',i-1)} x\_{(q',i-1)}^{(t-1)} &\qquad \text{(by log bounded)} \\ &\geq \frac{1}{cm} \max\_{(q',i-1) \text{ s.t. } M\_{(q,i),(q',i-1)} > 0} x\_{(q',i-1)}^{(t-1)} &\qquad \text{(by defn of } m) \end{aligned}$$

In particular

$$x\_{(q,i)}^{(t)} \ge \frac{1}{cm} x\_{(q',i-1)}^{(t-1)} \text{ for all } (q', i-1) \text{ s.t. } M\_{(q,i),(q',i-1)} > 0$$

Using induction we obtain:

$$x\_{(q,i+k)}^{(t+k)} \ge \frac{1}{(cm)^{k-1}} x\_{(q',i+1)}^{(t+1)} \ge \frac{1}{(cm)^k} x\_{(q'',i)}^{(t)}$$

for all (q ′ , i + 1),(q ′′, i) such that M<sup>k</sup>−<sup>1</sup> (q,i+k),(q ′ ,i+1) > 0 and M(<sup>q</sup> ′ ,i+1),(q ′′,i) > 0.

In particular, we have x (t+C) (q,i) ≥ 1 (cm)<sup>C</sup> x (t) (q ′ ,i) for all q ′ (since M<sup>C</sup> (q,i),(q ′ ,i) > 0 for all q ′ by the previous lemma).

On the other hand we have

$$x\_{(q,i+1)}^{(t+1)} = \left[ \sum\_{q': M\_{\{q,i+1\}, \{q',i\}} > 0} M\_{\{q,i+1\}, \{q',i\}} x\_{(q',i)}^{(t)} \right] \le mcd \max\_{\{q',i\} \in S\_j} x\_{(q',i)}^{(t)}.$$

By induction we get that x (t+C) (q,i) ≤ (mcd) <sup>C</sup> max(<sup>q</sup> ′ ,i)∈S<sup>j</sup> x (t) (q ′ ,i) . Hence, for all q, q′ ∈ S<sup>j</sup> we have

$$\frac{1}{(mc)^C} \max\_{(q^\prime,i)\in S\_j} x\_{(q^{\prime\prime},i)}^{(t)} \le x\_{(q^\prime,i)}^{(t+C)} \quad \text{and} \quad x\_{(q,i)}^{(t+C)} \le (mcd)^C \max\_{(q^{\prime\prime},i)\in S\_j} x\_{(q^{\prime\prime},i)}^{(t)}.$$

Hence x (t+C) (q,i) x (t+C) (q′,i) ≤ d <sup>C</sup> (mc) <sup>2</sup><sup>C</sup> .

Setting γ = log<sup>10</sup> d <sup>C</sup> (mc) 2C , we thus have that 10<sup>−</sup><sup>γ</sup>x (t+C) (q ′ ,i) ≤ x (t+C) (q,i) ≤ 10<sup>γ</sup>x (t+C) (q ′ ,i) for all (q, i),(q ′ , i) ∈ Sj,i and t ∈ N. Then x (t) (q ′ ,i) and x (t) (q,i) are β = γ+2 close by Proposition 1. ⊓⊔ Lemma 4. Let S<sup>j</sup> be a top strongly connected component of (M, x). Then the sequence (x (t) S<sup>j</sup> )t∈<sup>N</sup> is efectively pseudo-periodic.

Proof. Let β and C be as in Lemma 3. Denote q1, . . . , q<sup>m</sup> the states of S<sup>j</sup> . We defne the sequence (y (t) )t≥<sup>C</sup> such that for all t ≥ C and q ∈ S<sup>j</sup> denoting (p (t) )<sup>q</sup> = mantissa([x (t) <sup>q</sup> ]) and (α (t) )<sup>q</sup> = exponent([x (t) <sup>q</sup> ]) we have that y (t) = (p<sup>q</sup><sup>1</sup> , 0, p<sup>q</sup><sup>2</sup> , α<sup>q</sup><sup>2</sup> − α<sup>q</sup><sup>1</sup> , . . . , p<sup>q</sup>m, α<sup>q</sup><sup>m</sup> − α<sup>q</sup><sup>1</sup> ). Note that this sequence can only take fnitely many values as the mantissas have a precision of p decimals and by Lemma 3, for all k ≤ m, α<sup>q</sup>k−α<sup>q</sup><sup>1</sup> ∈ {−β, . . . , β}. As a consequence, the sequence (y (t) )t≥<sup>C</sup> takes the same value multiple times. Let k<sup>1</sup> and k<sup>2</sup> be the two distinct minimal integers such that y (k1) = y (k2) . Setting α = α (k2) <sup>q</sup><sup>1</sup> − α (k1) <sup>q</sup><sup>1</sup> We have that x (k1) = x (k2) ·10<sup>α</sup>. Since [·] is mantissa-based, one can show by induction that for all t ≥ 0, x (k1+t) = x (k2+t) · 10<sup>α</sup>. Therefore the sequence (x (t) S<sup>j</sup> )t∈<sup>N</sup> is efectively pseudo-periodic with period T = k<sup>2</sup> − k<sup>1</sup> and starting point N = C + k1.

Moreover, as the maximum number of diferent values taken by (y (t) )t≥<sup>C</sup> is known, we can deduce that both k<sup>1</sup> and k2−k<sup>1</sup> are smaller than 10pm(2β+1)<sup>m</sup>+1. ⊓⊔

Note that the increasing rate is the same for every state of the strongly connected component.

#### 4.3 Pseudo-periodicity within lower SCCs

We consider a strongly connected component Sme, which is fed by at least one strongly connected components F1, . . . , Fℓ, ℓ ≥ 1. We let S<sup>F</sup> = F<sup>1</sup> ∪ · · · ∪ F<sup>ℓ</sup> and assume every F<sup>i</sup> is pseudo-periodic.

In this section we show

Theorem 4. Sme is efectively pseudo-periodic and the growth rate of Sme is the same for all q ∈ Sme.

We frst observe that the diference between values in Sme is bounded. This is achieved with a proof similar to the one of Lemma 2 and Lemma 3 (though having to combine considerations of Sme and S<sup>F</sup> ).

Lemma 5. There exists η, N′ ∈ N, such that for all (q, i),(q ′ , i) ∈ Sme, all t ≥ N′ and all i ∈ {0, . . . , P − 1} then

– if t ̸≡ i mod P, then x (t) (q,i) = 0, – otherwise, x (t) (q,i) ≈<sup>η</sup> x (t) (q ′ ,i) .

Defnition 6. We say that x (t) <sup>q</sup> is infuenced by S<sup>F</sup> if

$$x\_q^{(t)} = \left[\sum\_{q' \in S\_F} M\_{q,q'} x\_{q'}^{(t-1)} + \sum\_{q' \in S\_{me}} M\_{q,q'} x\_{q'}^{(t-1)}\right] \neq \left[\sum\_{q' \in S\_{me}} M\_{q,q'} x\_{q'}^{(t-1)}\right]^2$$

and in particular x (t) <sup>q</sup> is infuenced by u ∈ S<sup>F</sup> if:

$$\left[\sum\_{q' \in S\_F \cup S\_{me}} M\_{q,q'} x\_{q'}^{(t-1)}\right] \neq \left[\sum\_{q' \in S\_F \cup S\_{me}} M\_{q,q'} x\_{q'}^{(t-1)}\right].$$

We can restrict S<sup>F</sup> to the F<sup>i</sup> in S<sup>F</sup> with the maximum growth rate. Indeed, from some point on, any F<sup>i</sup> with non-maximal growth rate is much smaller than the maximal ones, and as by the proof of Lemma 5 the values within Sme are close to (or greater than) the maximum value within S<sup>F</sup> , this F<sup>i</sup> would not infuence with any x (t) <sup>q</sup> with q ∈ Sme. Let N<sup>1</sup> be the point from which we can assume, that the elements of S<sup>F</sup> are much larger than any other feeding SCCs and are thus the only ones potentially infuencing of Sme.

Since each F<sup>i</sup> is assumed to be pseudo-periodic, we have that S<sup>F</sup> pseudoperiodic. Let T be the period of S<sup>F</sup> , N<sup>2</sup> be the starting point and α be the growth rate of every state of S<sup>F</sup> (meaning the exponent of every state changes by α every T starting form the N-th step.) Let N = max{N1, N2}, that is, the point from which we can assume S<sup>F</sup> is both pseudo-periodic and dominating non-maximal SCCs feeding Sme.

As a direct consequence of having the same growth rate, the non-zero terms within S<sup>F</sup> are close:

Proposition 2. If a sequence of non-zero foating-point vectors (v (t) )t∈<sup>N</sup> is pseudoperiodic with the same growth rate within a set Q, then there exists δ such that for all q, q′ ∈ Q and all t ≥ N, v (t) <sup>q</sup> ≈<sup>δ</sup> v (t) q ′ .

Moreover, either S<sup>F</sup> does not infuence Sme, or they are close.

Lemma 6. There exists β, N ∈ N such that: For t ≥ N and (q, i) ∈ Sme, if x (t) (q,i) is infuenced by (q ′ , i − 1) ∈ S<sup>F</sup> , then x (t) (r,i) ≈<sup>β</sup> x (t) (r ′ ,i) for all (r, i),(r ′ , i) ∈ Sme ∪ S<sup>F</sup> .

We will show Theorem 4 through the following observation:

Observation 1. Observe that S<sup>F</sup> either infuences Sme infnitely many times or fnitely many times. We have two cases:


It will then remain to show that we can detect which of the two cases applies, and place a bound on the time to detect this, which will efectively reveal the constants of the pseudo-periodic behaviour.

We now present a version of Lemma 4 to observe that if S<sup>F</sup> and Sme are infnitely often β-close then Sme is pseudo-periodic:

Lemma 7. Suppose x (t) S<sup>F</sup> ≈<sup>β</sup> x (t) Sme for infnitely many t. Then there exists t<sup>1</sup> < t2, such that x (t1) S<sup>F</sup> ≈<sup>β</sup> x (t1) Sme and x (t2) S<sup>F</sup> ≈<sup>β</sup> x (t2) Sme , x (t2) S<sup>F</sup> = 10<sup>γ</sup>x (t1) S<sup>F</sup> and x (t2) Sme = 10<sup>γ</sup>x (t1) Sme . In particular, the sequence (x (t) Sme )t∈<sup>N</sup> is pseudo-periodic with period (t<sup>2</sup> − t1), starting from t<sup>1</sup> with growth rate of γ in every state.

Proof. At a time t such that x (t) S<sup>F</sup> ≈<sup>β</sup> x (t) Sme , we denote the vectors x (t) S<sup>F</sup> ∈ FP10[p] <sup>|</sup>S<sup>F</sup> <sup>|</sup> and x (t) Sme ∈ FP10[p] |Sme| respectively

$$\begin{aligned} \left(m\_1^{\langle t\rangle}10^{\gamma\_1^{\langle t\rangle}}, m\_2^{\langle t\rangle}10^{\gamma^{\langle t\rangle}+\alpha\_2^{\langle t\rangle}}, \dots, m\_{|S\_F|}^{\langle t\rangle}10^{\gamma^{\langle t\rangle}+\alpha\_{|S\_F|}^{\langle t\rangle}}\right) \text{and} \\ \left(n\_1^{\langle t\rangle}10^{\gamma^{\langle t\rangle}+\zeta\_1^{\langle t\rangle}}, \dots, n\_{|S\_{ma}|}^{\langle t\rangle}10^{\gamma^{\langle t\rangle}+\zeta\_{|S\_{ma}|}^{\langle t\rangle}}\right), \end{aligned}$$

where m<sup>i</sup> , n<sup>i</sup> are taken from the fnite set of mantissa values expressible in p bits, γ (t) ∈ Z and α<sup>i</sup> , ζ<sup>i</sup> ∈ Z ∩ [−β, β] denote the ofset from γ (t) .

Let F bound the number of possible values m<sup>i</sup> , n<sup>i</sup> , α<sup>i</sup> , ζ<sup>i</sup> can take on, where F ≤ 10<sup>p</sup>(|S<sup>F</sup> <sup>|</sup>+|Sme|) · (2β + 1)<sup>|</sup>S<sup>F</sup> <sup>|</sup>+|Sme|−<sup>1</sup> . By the pigeonhole principle, after at most F + 1 times in which x (t) S<sup>F</sup> ≈<sup>β</sup> x (t) Sme there must exist two times t<sup>1</sup> < t<sup>2</sup> where the values of m<sup>i</sup> , n<sup>i</sup> , α<sup>i</sup> , β<sup>i</sup> 's are equal (although the value of γ could be diferent), thus x (t2) S<sup>F</sup> ∪Sme = 10<sup>γ</sup> (t2) (t1) x (t1) S<sup>F</sup> ∪Sme .

10<sup>γ</sup> Since the rounding function is mantissa-based, the system evolution from x (t1) is equivalent to the systems evolution from x (t2) = 10<sup>γ</sup>x (t1) , where γ is the growth rate, γ (t2) − γ (t1) . ⊓⊔

We can in fact decide whether x (t) S<sup>F</sup> ≈<sup>β</sup> x (t) Sme for the last time:

Lemma 8. Let β, N be defned as in Lemma 6. If t ≥ N then it is decidable whether there exists t ′ > t such that x (t ′ ) S<sup>F</sup> ≈<sup>β</sup> x (t ′ ) Sme .

Proof Sketch (Full proof available in [19]). If we considered Sme in isolation, without the efect of S<sup>F</sup> , we know it would be pseudo-periodic. We can simulate one period of Sme with and without the efect of S<sup>F</sup> and determine if S<sup>F</sup> infuences Sme within one period. If it does then they must be close at this point. If S<sup>F</sup> does not infuence Sme we know that Sme will behave pseudo-periodically at least until S<sup>F</sup> is close to Sme again; having established a growth rate for Sme, we can compare the growth rates of S<sup>F</sup> and Sme to see if Sme will ever be close to S<sup>F</sup> again in the future. ⊓⊔

Finally to conclude the proof of Theorem 4, we refne Observation 1 to show that the period is bounded and thus the growth rates are computable:


Which of these occurs is determined by at most F + 1 applications of Lemma 8.

## 5 Decidability of model checking

In this section we use the results obtained in the previous section to show that model checking is decidable. We use pseudo-periodicity to show that the characteristic word is eventually periodic, a case for which model checking is decidable.

Theorem 2. Let (M, x) be a non-negative linear dynamical system, let Y1, . . . , Y<sup>k</sup> be semialgebraic targets and let ϕ be an MSO formula using predicates over Y1, . . . , Yk. It is decidable whether the characteristic word under foating-point rounding satisfes ϕ.

Consider a semialgebraic target Y , which can be expressed as a Boolean combination of polynomial inequalities over variables representing the dimensions. That is Y = {(x1, . . . , xd) | V i W <sup>j</sup> Pij (x1, . . . , xn) ▷ij 0}, where ▷ij ∈ {≥, >, =}.

Given a linear dynamical system (M, x) defning the rounded orbit (x (n) )<sup>∞</sup> <sup>n</sup>=1, recall that Z(Y ) = {n | x (n) ∈ Y } are the hitting times of Y . We claim that this set is semi-linear (equivalently eventually periodic) for semialgebraic Y .

Defnition 7. A 1-dimensional linear-set, defned by a base b ∈ N and period p ∈ N, is the set {x | ∃k ∈ N : x = b + k · p}. A semi-linear set is the fnite union of a fnite set F ⊆ N and linear sets. It can be assumed that each linear-set has the same period. Hence a 1-dimensional semi-linear set X is defned by a fnite set F ⊆ N and integers m, p, b1, . . . , b<sup>m</sup> ∈ N such that x ∈ X if and only if x ∈ F or x = b + k · p for some k ∈ N and b ∈ {b1, . . . , bm}.

Theorem 5. Let Y be a semialgebraic target, Z(Y ) is a semi-linear set.

Theorem 5 essentially completes the proof of Theorem 2. It is almost immediate that the characteristic word is eventually periodic (see the long version [19] for a formal proof) and thus the model-checking problem can be decided by checking A ∩ B = ∅, where A is an automaton representing the characteristic word and B encodes the language of ϕ.

It is standard that semi-linear sets are closed under intersection, union, and complementation (see [15] for a nice introduction to semi-linear sets). Thus in order to express the hitting times of Z(Y ) it is sufcient to express the hitting times of {(x1, . . . , xd) | P(x1, . . . , xn) ≥ 0} for a fnitely many polynomials P. Conjunction is found by taking the intersection of the hitting times, and disjunction by taking union. The hitting times of P(x1, . . . , xn) > 0 can be rewritten as the complement of the hitting times of −P(x1, . . . , xn) ≥ 0. The hitting times of P(x1, . . . , xn) = 0 is the conjunction (intersection) of P(x1, . . . , xn) ≥ 0 and −P(x1, . . . , xn) ≥ 0. Thus Theorem 5 is a consequence of the following lemma.

Lemma 9. Assume x (t) = (z (t) 1 , . . . , z (t) d )<sup>∞</sup> <sup>i</sup>=1, is a pseudo-periodic sequence with start point N, period T and growth rates α1, . . . , α<sup>n</sup> and P ∈ Q[x1, · · · , xd] a rational polynomial in d variables.<sup>9</sup> Then, {i ∈ N | P(z (t) 1 , · · · , z (t) d ) ≥ 0} is a semi-linear set.

<sup>9</sup> Some variables may be redundant, that is, if the polynomial does not depend on all dimensions of x (t) then some of the variables may not appear in P.

Proof. First, we show that pseudo-periodicity is closed under product. Suppose x (N+T n) <sup>i</sup> = mi10<sup>β</sup>i+αi·<sup>n</sup> and x (N+T n) <sup>j</sup> = mj10<sup>β</sup>j+α<sup>j</sup> ·<sup>n</sup>. Observe that x (N+T n) i · x (N+T n) <sup>j</sup> = m<sup>i</sup> · 10<sup>β</sup>i+αi<sup>n</sup>m<sup>j</sup> · 10<sup>β</sup>j+αj<sup>n</sup> = mim<sup>j</sup> · 10<sup>β</sup>i+βj+n(αi+α<sup>j</sup> ) . We conclude that the vector (x<sup>i</sup> · x<sup>j</sup> ) (t) is pseudo-periodic with growth rate α<sup>i</sup> + α<sup>j</sup> . Observe that the mantissa precision increase by at most 2.

Secondly, we show that if two pseudo-periodic sequences have the same growth rate, then their sum is also pseudo-periodic with the same growth rate. Suppose x (N+T n) <sup>i</sup> = mi10<sup>β</sup>i+α·<sup>n</sup>, and x (N+T n) <sup>j</sup> = mj10<sup>β</sup>j+α·<sup>n</sup>. Observe that (x<sup>i</sup> + x<sup>j</sup> ) (N+T n) = mi10<sup>β</sup>i+α·<sup>n</sup> + mj10<sup>β</sup>j+α·<sup>n</sup> = (m<sup>i</sup> + m<sup>j</sup> · 10<sup>β</sup>j−β<sup>i</sup> )10<sup>β</sup>i+α·<sup>n</sup>. Observe that the mantissa precision increased by at most 10<sup>|</sup>βj−βi<sup>|</sup> .

Let P(x1, . . . , xn) = P<sup>N</sup> <sup>i</sup>=1 ciZ<sup>i</sup> , where Z<sup>i</sup> is a product of x1, . . . , xn. Consider each monomial Z<sup>i</sup> occurring in P, since produce preserves pseudo-periodicity, we conclude that Z<sup>i</sup> is pseudo-periodic. P (t) is thus a linear combination of these pseudo-periodic vectors. Note our prior observation does not immediately imply that P (t) is pseudo-periodic as we required taking the sum of elements with the same growth rate. However, from some point on, we are only interested in those with the maximal growth rate.

Without loss of generality, let Z1, . . . , Z<sup>r</sup> have the maximum-growth rate, and Zr+1, . . . , Z<sup>N</sup> have strictly smaller growth rate. For every L ∈ N there exists N ∈ N such that for all t > N, exponent(Z (t) 1 ) − exponent(Z (t) <sup>r</sup>+1) > L.

Hence there exists N ∈ N such that for all t > N if P<sup>r</sup> <sup>i</sup>=1 ciZ<sup>i</sup> > 0 if and only if P<sup>N</sup> <sup>i</sup>=1 ciZ<sup>i</sup> = P<sup>r</sup> <sup>i</sup>=1 ciZ<sup>i</sup> + P<sup>N</sup> <sup>i</sup>=r+1 ciZ<sup>i</sup> > 0 because P<sup>N</sup> <sup>i</sup>=r+1 ciZ<sup>i</sup>  < | P<sup>r</sup> <sup>i</sup>=1 ciZ<sup>i</sup> | from some point on. Hence sign(P<sup>N</sup> <sup>i</sup>=1 ciZ (t) i ) = sign(P<sup>r</sup> <sup>i</sup>=1 ciZ (t) i ).

Thus we restrict our attention to P<sup>r</sup> <sup>i</sup>=1 ciZ (t) i . Since each of the Z<sup>i</sup> for i ∈ {1, . . . , r} have the same growth rate, we know that P<sup>r</sup> <sup>i</sup>=1 ciZ (t) i is pseudoperiodic. Since sign(P<sup>r</sup> <sup>i</sup>=1 ciZ (t) i ) does not depend on the exponent, only the periodic mantissa, we have that the sign is periodic. The hitting times for t ≤ N can be determined exhaustively and included in the fnite set of the semi-linear set. ⊓⊔

Acknowledgements Partially funded by DFG grant 389792660 as part of TRR 248 – CPEC, see perspicuous-computing.science. Joël Ouaknine is also afliated with Keble College, Oxford as emmy.network Fellow. David Purser was partially supported by the ERC grant INFSYS, agreement no. 950398.

## References

1. Abbasi, R., Schif, J., Darulova, E., Ulbrich, M., Ahrendt, W.: Deductive verifcation of foating-point java programs in key. In: Groote, J.F., Larsen, K.G. (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 27th International Conference, TACAS 2021, Part of ETAPS 2021. Part II. Lecture Notes in Computer Science, vol. 12652, pp. 242–261. Springer (2021). https://doi.org/10.1007/978-3-030-72013-1\_13


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Efcient Loop Conditions for Bounded Model Checking Hyperproperties<sup>⋆</sup>

Tzu-Han Hsu<sup>1</sup> , César Sánchez<sup>2</sup> , Sarai Sheinvald<sup>3</sup> , and Borzoo Bonakdarpour<sup>1</sup>(B)

<sup>1</sup> Michigan State University, East Lansing, MI, USA {tzuhan,borzoo}@msu.edu 2 IMDEA Software Institute, Madrid, Spain cesar.sanchez@imdea.org <sup>3</sup> Dept. of Software Engineering, Braude College, Karmiel, Israel sarai@braude.ac.il

Abstract. Bounded model checking (BMC) is an efective technique for hunting bugs by incrementally exploring the state space of a system. To reason about infnite traces through a fnite structure and to ultimately obtain completeness, BMC incorporates loop conditions that revisit previously observed states. This paper focuses on developing loop conditions for BMC of HyperLTL– a temporal logic for hyperproperties that allows expressing important policies for security and consistency in concurrent systems, etc. Loop conditions for HyperLTL are more complicated than for LTL, as diferent traces may loop inconsistently in unrelated moments. Existing BMC approaches for HyperLTL only considered linear unrollings without any looping capability, which precludes both fnding small infnite traces and obtaining a complete technique. We investigate loop conditions for HyperLTL BMC, for HyperLTL formulas that contain up to one quantifer alternation. We frst present a general complete automatabased technique which is based on bounds of maximum unrollings. Then, we introduce alternative simulation-based algorithms that allow exploiting short loops efectively, generating SAT queries whose satisfability guarantees the outcome of the original model checking problem. We also report empirical evaluation of the prototype implementation of our BMC techniques using Z3py.

## 1 Introduction

Hyperproperties [13] have been getting increasing attention due to their power to reason about important specifcations such as information-fow security policies that require reasoning about the interrelation among diferent execution traces. HyperLTL [12] is an extension of the linear-time temporal logic LTL [31] that allows quantifcation over traces; hence, capable of describing hyperproperties. For example, the security policy observational determinism can be specifed as

© The Author(s) 2023

<sup>⋆</sup> This research has been partially supported by the United States NSF SaTC Award 2100989, by the Madrid Regional Gov. Project BLOQUES-CM (S2018/TCS-4339), by Project PRODIGY (TED2021-132464B-I00) funded by MCIN/AEI/10.13039/501100011033/ and the EU NextGenerationEU/PRTR, and by a research grant from Nomadic Labs and the Tezos Foundation.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. https://doi.org/10.1007/978-3-031-30823-9\_4 66–84, 2023.

HyperLTL formula: ∀π.∀π ′ .(o<sup>π</sup> ↔ oπ′ ) W ¬(i<sup>π</sup> ↔ iπ′ ), which specifes that for every pair of traces π and π ′ , if they agree on the secret input i, then their public output o must also be observed the same (here 'W' denotes the weak until operator).

Several works [14,22] have studied model checking techniques for HyperLTL specifcations, which typically reduce this problem to LTL model checking queries of modifed systems. More recently, [27] proposed a QBF-based algorithm for the direct application of bounded model checking (BMC) [11] to HyperLTL, and successfully provided a push-button solution to verify or falsify HyperLTL formulas with an arbitrary number of quantifer alternations. However, unlike the classic BMC for LTL, which included the so-called loop conditions, the algorithm in [27] is limited to (non-looping) linear exploration of paths. The reason is that extending path exploration to include loops when dealing with multiple paths simultaneously is not straightforward. For example, consider the HyperLTL formula φ<sup>1</sup> = ∀π.∃π ′ . (a<sup>π</sup> ! bπ′ ) and two Kripke structures K<sup>1</sup> and K<sup>2</sup> as follows:

Assume trace π ranges over K<sup>1</sup> and trace π ′ ranges over K2. Proving ⟨K1, K2⟩ ̸|= φ<sup>1</sup> can be achieved by fnding a fnite counterexample (i.e., path s1s2s<sup>3</sup> from K1). Now, consider φ<sup>2</sup> = ∀π.∃π ′ . (a<sup>π</sup> ↔ aπ′ ). It is easy to see that ⟨K1, K2⟩ |= φ2. However, to prove ⟨K1, K2⟩ |= φ2, one has to show the absence of counterexamples in infnite paths, which is impossible with model unrolling in fnite steps as proposed in [27].

In this paper, we propose efcient loop conditions for BMC of hyperproperties. First, using an automata-based method, we show that lasso-shaped traces are sufcient to prove infnite behaviors of traces within fnite exploration. However, this technique requires an unrolling bound that renders it impractical. Instead, our efcient algorithms are based on the notion of simulation [32] between two systems. Simulation is an important tool in verifcation, as it is used for abstraction, and preserves ACTL<sup>∗</sup> properties [6,24]. As opposed to more complex properties such as language containment, simulation is a more local property and is easier to check. The main contribution of this paper is the introduction of practical algorithms that achieve the exploration of infnite paths following a simulation-based approach that is capable of relating the states of multiple models with correct successor relations.

We present two diferent variants of simulation, SIMEA and SIMAE, allowing to check the satisfaction of ∃∀ and ∀∃ hyperproperties, respectively. These notions circumvent the need to boundlessly unroll traces in both structures and synchronize them. For SIMAE, in order to resolve non-determinism in the frst model, we also present a third variant, where we enhance SIMAE by using prophecy variables [1,7]. Prophecy variables allow us to handle cases in which ∀∃ hyperproperties hold despite the lack of a direct simulation. With our simulation-based approach, one can capture infnite behaviors of traces with fnite exploration in a simple and concise way. Furthermore, our BMC approach not only modelchecks the systems for hyperproperties, but also does so in a way that fnds minimal witnesses to the simulation (i.e., by partially exploring the existentially quantifed model), which we will further demonstrate in our empirical evaluation.

We also design algorithms that generate SAT formulas for each variant (i.e., SIMEA, SIMAE, and SIMAE with prophecies), where the satisfability of formulas implies the model checking outcome. We also investigate the practical cases of models with diferent sizes leading to the eight categories in Table 1. For example, the


Table 1: Eight categories of HyperLTL formulas with diferent forms of quantifers, sizes of models, and diferent temporal operators.

frst row indicates the category of verifying two models of diferent sizes with the fragment that only allows ∀∃ quantifers and (i.e., globally temporal operator); ∀small∃big means that the frst model is relatively smaller than the second model, and the positive outcome (|= ∀∃φ) can be proved by our simulation-based technique SIMAE, while the negative outcome (̸|= ∀∃φ) can be easily checked using non-looping unrolling (i.e., [27]). We will show that in certain cases, one can verify a formula without exploring the entire state space of the big model to achieve efciency.

We have implemented our algorithms<sup>1</sup> using Z3py, the Z3 [15] API in python. We demonstrate the efciency of our algorithm exploring a subset of the state space for the larger (i.e., big) model. We evaluate the applicability and efciency with cases including conformance checking for distributed protocol synthesis, model translation, and path planning problems. In summary, we make the following contributions: (1) a bounded model checking algorithm for hyperproperties with loop conditions, (2) three diferent practical algorithms: SIMEA, SIMAE, and SIMAE with prophecies, and (3) a demonstration of the efciency and applicability by case studies that cover through all eight diferent categories of HyperLTL formulas (see Table 1).

Related Work. Hyperproperties were frst introduced by Clarkson and Schneider [13]. HyperLTL was introduced as a temporal logic for hyperproperties in [12]. The frst algorithms for model checking HyperLTL were introduced in [22] using alternating automata. Automated reasoning about HyperLTL specifcations has received attention in many aspects, including static verifcation [14,20,21,22] and monitoring [2,8,10,18,19,26,33]. This includes tools support, such as MCHy-

<sup>1</sup> Available at: https://github.com/TART-MSU/loop\_condition\_tacas23

per [22,14] for model checking, EAHyper [17] and MGHyper [16] for satisfability checking, and RVHyper [18] for runtime monitoring. However, the aforementioned tools are either limited to HyperLTL formulas without quantifer alternations, or requiring additional inputs from the user (e.g., manually added strategies [14]).

Recently, this difculty of alternating formulas was tackled by the bounded model checker HyperQB [27] using QBF solving. However, HyperQB lacks loop conditions to capture early infnite traces in fnite exploration. In this paper, we develop simulation-based algorithms to overcome this limitation. There are alternative approaches to reason about infnite traces, like reasoning about strategies to deal with ∀∃ formulas [14], whose completeness can be obtained by generating a set of prophecy variables [7]. In this work, we capture infnite traces in BMC approach using simulation. We also build an applicable prototype for model-check HyperLTL formulas with models that contain loops.

## 2 Preliminaries

Kripke structures. A Kripke structure K is a tuple ⟨S, S<sup>0</sup> , δ, AP, L⟩, where S is a set of states, S <sup>0</sup> ⊆ S is a set of initial states, δ ⊆ S × S is a total transition relation, and L : S ! 2 AP is a labeling function, which labels states s ∈ S with a subset of atomic propositions in AP that hold in s. A path of K is an infnite sequence of states s(0)s(1)· · · ∈ S <sup>ω</sup>, such that s(0) ∈ S 0 , and (s(i), s(i + 1)) ∈ δ, for all i ≥ 0. A loop in K is a fnite path s(n)s(n+ 1)· · · s(ℓ), for some 0 ≤ n ≤ ℓ, such that (s(i), s(i + 1)) ∈ δ, for all n ≤ i < ℓ, and (s(ℓ), s(n)) ∈ δ. Note that n = ℓ indicates a self-loop on a state. A trace of K is a trace t(0)t(1)t(2)· · · ∈ Σ ω, such that there exists a path s(0)s(1)· · · ∈ S <sup>ω</sup> with t(i) = L(s(i)) for all i ≥ 0. We denote by Traces(K, s) the set of all traces of K with paths that start in state s ∈ S. We use Traces(K) as a shorthand for S <sup>s</sup>∈S<sup>0</sup> Traces(K, s), and L(K) as the shorthand for Traces(K).

Simulation relations. Let K<sup>A</sup> = ⟨SA, S<sup>0</sup> <sup>A</sup>, δA, APA, LA⟩ and K<sup>B</sup> = ⟨SB, S<sup>0</sup> <sup>B</sup>, δB, APB, LB⟩ be two Kripke structures. A simulation relation R from K<sup>A</sup> to K<sup>B</sup> is a relation R ⊆ S<sup>A</sup> × S<sup>B</sup> that meets the following conditions:


The Temporal Logic HyperLTL. HyperLTL [12] is an extension of the lineartime temporal logic (LTL) for hyperproperties. The syntax of HyperLTL formulas is defned inductively by the following grammar:

$$\begin{aligned} \varphi &::= \exists \pi. \varphi \mid \forall \pi. \varphi \mid \phi\\ \phi &::= \mathtt{true} \mid a\_{\pi} \mid \neg \phi \mid \phi \lor \phi \mid \phi \land \phi \mid \phi \mathcal{U} \,\phi \mid \phi \,\mathcal{R} \,\phi \mid \bigcirc \phi \end{aligned}$$

where a ∈ AP is an atomic proposition and π is a trace variable from an infnite supply of variables V. The Boolean connectives ¬, ∨, and ∧ have the usual meaning, U is the temporal until operator, R is the temporal release operator, and is the temporal next operator. We also consider other derived Boolean connectives, such as ! and ↔, and the derived temporal operators eventually φ ≡ true U φ and globally φ ≡ ¬¬φ. A formula is closed (i.e., a sentence) if all trace variables used in the formula are quantifed. We assume, without loss of generality, that no trace variable is quantifed twice. We use Vars(φ) for the set of trace variables used in formula φ.

Semantics. An interpretation T = ⟨Tπ⟩π∈Vars(φ) of a formula φ consists of a tuple of sets of traces, with one set T<sup>π</sup> per trace variable π in Vars(φ), denoting the set of traces that π ranges over. Note that we allow quantifers to range over diferent models, called the multi-model semantics [23,27] 2 . That is, each set of traces comes from a Kripke structure and we use K = ⟨Kπ⟩π∈Vars(φ) to denote a family of Kripke structures, so T<sup>π</sup> = Traces(Kπ) is the traces that π can range over, which comes from K<sup>π</sup> ∈ K. Abusing notation, we write T = Traces(K).

The semantics of HyperLTL is defned with respect to a trace assignment, which is a partial map Π : Vars(φ) ⇀ Σ <sup>ω</sup>. The assignment with the empty domain is denoted by Π∅. Given a trace assignment Π, a trace variable π, and a concrete trace t ∈ Σ <sup>ω</sup>, we denote by Π[π ! t] the assignment that coincides with Π everywhere but at π, which is mapped to trace t. The satisfaction of a HyperLTL formula φ is a binary relation |= that associates a formula to the models (T , Π, i) where i ∈ Z<sup>≥</sup><sup>0</sup> is a pointer that indicates the current evaluating position. The semantics is defned as follows:

(T , Π, 0) |= ∃π. ψ if there is a t ∈ Tπ, such that (T , Π[π ! t], 0) |= ψ, (T , Π, 0) |= ∀π. ψ if for all t ∈ Tπ, such that (T , Π[π ! t], 0) |= ψ, (T , Π, i) |= true (T , Π, i) |= a<sup>π</sup> if a ∈ Π(π)(i), (T , Π, i) |= ¬ψ if (T , Π, i) ̸|= ψ(T , Π, i) (T , Π, i) |= ψ<sup>1</sup> ∨ ψ<sup>2</sup> if (T , Π, i) |= ψ<sup>1</sup> or (T , Π, i) |= ψ2, (T , Π, i) |= ψ<sup>1</sup> ∧ ψ<sup>2</sup> if (T , Π, i) |= ψ<sup>1</sup> and (T , Π, i) |= ψ2, (T , Π, i) |= ψ if (T , Π, i + 1) |= ψ, (T , Π, i) |= ψ<sup>1</sup> U ψ<sup>2</sup> if there is a j ≥ i for which (T , Π, j) |= ψ<sup>2</sup> and for all k ∈ [i, j),(T , Π, k) |= ψ1, (T , Π, i) |= ψ<sup>1</sup> R ψ<sup>2</sup> if either for all j ≥ i, (T , Π, j) |= ψ2, or, for some j ≥ i,(T , Π, j) |= ψ<sup>1</sup> and for all k ∈ [i, j] : (T , Π, k) |= ψ2.

We say that an interpretation T satisfes a sentence φ, denoted by T |= φ, if (T , Π∅, 0) |= φ. We say that a family of Kripke structures K satisfes a sentence φ, denoted by K |= φ, if ⟨Traces(Kπ)⟩π∈Vars(φ) |= φ. When the same Kripke structure K is used for all path variables we write K |= φ.

Defnition 1. A nondeterministic Büchi automaton (NBW) is a tuple A = ⟨Σ, Q, Q0, δ, F⟩, where Σ is an alphabet, Q is a nonempty fnite set of

<sup>2</sup> In terms of the model checking problem, multi-model and (the conventional) singlemodel semantics where all paths are assigned traces from the same Kripke structure [12] are equivalent (see [23,27]).

states, Q<sup>0</sup> ⊆ Q is a set of initial states, F ⊆ Q is a set of accepting states, and δ ⊆ Q × Σ × Q is a transition relation.

Given an infnite word w = σ1σ<sup>2</sup> · · · over Σ, a run of A on w is an infnite sequence of states r = (q0, q1, . . .), such that q<sup>0</sup> ∈ Q0, and (qi−1, σ<sup>i</sup> , qi) ∈ δ for every i > 0. The run is accepting if r visits some state in F infnitely often. We say that A accepts w if there exists an accepting run of A on w. The language of A, denoted L(A), is the set of all infnite words accepted by A. An NBW A is called a safety NBW if all of its states are accepting. Every safety LTL formula ψ can be translated into a safety NBW over 2 AP such that L(A) is the set of all traces over AP that satisfy ψ [29].

## 3 Adaptation of BMC to HyperLTL on Infnite Traces

There are two main obstacles in extending the BMC approach of [27] to handle infnite traces. First, a trace may have an irregular behavior. Second, even traces whose behavior is regular, that is, lasso shaped, are hard to synchronize, since the length of their respective prefxes and lassos need not to be equal. For the latter issue, synchronizing two traces whose prefxes and lassos are of lengths p1, p<sup>2</sup> and l1, l2, respectively, is equivalent to coordinating the same two traces, when defning both their prefxes to be of length max{p1, p2}, and their lassos to be of length lcm{l1, l2}, where 'lcm' stands for 'least common multiple'. As for the former challenge, we show that restricting the exploration of traces in the models to only consider lasso traces is sound. That is, considering only lassoshaped traces is equivalent to considering the entire trace set of the models.

Let K = ⟨S, S<sup>0</sup> , δ, AP, L⟩ be a Kripke structure. A lasso path of K is a path s(0)s(1). . . s(ℓ) such that (s(ℓ), s(n)) ∈ δ for some 0 ≤ n < ℓ. This path induces a lasso trace (i.e., a lasso) L(s0). . . L(sn−1) (L(sn). . . L(sℓ))<sup>ω</sup>. Let ⟨K1, . . . , Kk⟩ be a multi-model, we denote the set of lasso traces of K<sup>i</sup> by C<sup>i</sup> for all 1 ≤ i ≤ k, and we use L(Ci) as the shorthand for the set of lasso traces of K<sup>i</sup> .

Theorem 1. Let K = ⟨K1, . . . , Kk⟩ be a multi-model, and let φ = Q1π1. · · · Q<sup>k</sup> πk.ψ be a HyperLTL formula, both over AP, then K |= φ if ⟨C1, . . . , Ck⟩ |= φ.

Proof. (sketch) For an LTL formula ψ over AP × {πi} k i=1, we denote the translation of ψ to an NBW over 2 AP×{πi} k <sup>i</sup>=1 by A<sup>ψ</sup> [34]. Given α = Q1π<sup>1</sup> · · · Qkπk, where Q<sup>i</sup> ∈ {∃, ∀}, we defne the satisfaction of A<sup>ψ</sup> by K w.r.t. α, denoted K |= (α, Aψ), in the natural way: ∃π<sup>i</sup> corresponds to the existence of a path assigned to π<sup>i</sup> in K<sup>i</sup> , and dually for ∀π<sup>i</sup> . Then, K |= (α, Aψ) if the various kassignments of traces of K to {πi} k <sup>i</sup>=1 according to α are accepted by Aψ, which holds if K |= φ.

For a model K, we denote by K ∩<sup>k</sup> A<sup>ψ</sup> the intersection of K and A<sup>ψ</sup> w.r.t. AP × {πk}, taking the projection over AP × {πi} k−1 <sup>i</sup>=1 . Thus, L(K ∩<sup>k</sup> Aψ) is the set of all (k − 1)-words that an extension (i.e., ∃) by a word in L(K) to a kword in L(Aψ). Oppositely, L(K ∩<sup>k</sup> Aψ) is the set of all (k−1)-words that every extension (i.e., ∀) by a k-word in L(K) is in L(Aψ).

We frst construct NBWs A2, . . . , Ak−1, Ak, such that for every 1 < i < k, we have ⟨K1, . . . , Ki⟩ |= (α<sup>i</sup> , Ai+1) if K |= (α, Aψ), where α<sup>i</sup> = Q1π<sup>1</sup> . . . Qiπ<sup>i</sup> .

For i = k, if Q<sup>k</sup> = ∃, then A<sup>k</sup> = K<sup>k</sup> ∩<sup>k</sup> Aψ; otherwise if Q<sup>k</sup> = ∀, A<sup>k</sup> = K<sup>k</sup> ∩<sup>k</sup> Aψ. For 1 < i < k, if Q<sup>i</sup> = ∃ then A<sup>i</sup> = K<sup>i</sup> ∩<sup>i</sup> Ai+1; otherwise if Q<sup>i</sup> = ∀, A<sup>i</sup> = K<sup>i</sup> ∩<sup>i</sup> Ai+1. Then, for every 1 < i < k, we have ⟨K1, . . . , Ki⟩ |= (α<sup>i</sup> , Ai+1) if ⟨K1, . . . , Kk⟩ |= φ.

We now prove by induction on k that K |= φ if ⟨C1, . . . Ck⟩ |= φ. For k = 1, it holds that K |= φ if K<sup>1</sup> |= (Q1π1, A2). If Q<sup>1</sup> = ∀, then K<sup>1</sup> |= (Q1π1, A2) if K<sup>1</sup> ∩ A<sup>2</sup> = ∅. If Q<sup>1</sup> = ∃, then K<sup>1</sup> |= (Q1π1, A2) if K<sup>1</sup> ∩ A<sup>2</sup> ̸= ∅. In both cases, a lasso witness to the non-emptiness exists. For 1 < i < k, we prove that ⟨C1, . . . , C<sup>i</sup> , Ki+1⟩ |= (αi+1, Ai+2) if ⟨C1, . . . , C<sup>i</sup> , Ci+1⟩ |= (αi+1, Ai+2). If Q<sup>i</sup> = ∀, then the frst direction simply holds because L(Ci+1) ⊆ L(Ki+1). For the second direction, every extension of c1, c2, . . . c<sup>i</sup> (i.e., lassos in C1, C2, . . . Ci) by a path τ in Ki+1 is in L(Ai+2). Indeed, otherwise we can extract a lasso ci+1 such that c1, c2, . . . ci+1 is in L(Ai+2), a contradiction. If Q<sup>i</sup> = ∃, then L(Ci+1) ⊆ L(Ki+1) implies the second direction. For the frst direction, we can extract a lasso ci+1 ∈ L(Ci+1) such that ⟨c1, c2, . . . c<sup>i</sup> , ci+1⟩ ∈ L(Ai+2). ⊓⊔

One can use Theorem 1 and the observations above to construct a sound and complete BMC algorithm for both ∀∃ and ∃∀ hyperproperties. Indeed, consider a multi-model ⟨K1, K2⟩, and a hyperproperty φ = ∀π.∃π ′ . ψ. Such a BMC algorithm would try and verify ⟨K1, K2⟩ |= φ directly, or try and prove ⟨K1, K2⟩ |= ¬φ. In both cases, a run may fnd a short lasso example for the model under ∃ (K<sup>2</sup> in the former case and K<sup>1</sup> in the latter), leading to a shorter run. However, in both cases, the model under ∀ would have to be explored to the maximal lasso length implicated by Theorem 1, which is doubly-exponential. Therefore, this naive approach would be highly inefcient.

## 4 Simulation-Based BMC Algorithms for HyperLTL

We now introduce efcient simulation-based BMC algorithms for verifying hyperproperties of the types ∀π.∃π ′ .□Pred and ∃π.∀π ′ .□Pred, where Pred is a relational predicate (a predicate over a pair of states). The key observation is that simulation naturally induces the exploration of infnite traces without the need to explicitly unroll the structures, and without needing to synchronize the indices of the symbolic variables in both traces. Moreover, in some cases our algorithms allow to only partially explore the state space of a Kripke structure and give a conclusive answer efciently.

Let K<sup>P</sup> = ⟨S<sup>P</sup> , S<sup>0</sup> P , δ<sup>P</sup> , AP<sup>P</sup> , L<sup>P</sup> ⟩ and K<sup>Q</sup> = ⟨SQ, S<sup>0</sup> <sup>Q</sup>, δQ, APQ, LQ⟩ be two Kripke structures, and consider a hyperproperty of the form ∀π.∃π ′ . □Pred. Suppose that there exists a simulation from K<sup>P</sup> to KQ. Then, every trace in K<sup>P</sup> is embodied in KQ. Indeed, we can show by induction that for every trace t<sup>p</sup> = sp(1)sp(2). . . in K<sup>P</sup> , there exists a trace t<sup>q</sup> = sq(1)sq(2). . . in KQ, such that sq(i) simulates sp(i) for every i ≥ 1; therefore, t<sup>p</sup> and t<sup>q</sup> are equally labeled. We generalize the labeling constraint in the defnition of standard simulation by requiring, given Pred, that if (sp, sq) is in the simulation relation, then (sp, sq) |= Pred. We denote this generalized simulation by SIMAE. Following similar considerations, we now have that for every trace t<sup>p</sup> in K<sup>P</sup> , there exists a trace t<sup>q</sup> in K<sup>Q</sup> such that (tp, tq) |= □Pred. Therefore, the following result holds:

Lemma 1. Let K<sup>P</sup> and K<sup>Q</sup> be Kripke structures, and let φ = ∀π.∃π ′ . □Pred be a HyperLTL formula. If there exists SIMAE from K<sup>P</sup> to KQ, then ⟨K<sup>P</sup> , KQ⟩ |= φ.

We now turn to properties of the type ∃π.∀π ′ . □Pred. In this case, we must fnd a single trace in K<sup>P</sup> that matches every trace in KQ. Notice that SIMAE (in the other direction) does not sufce, since it is not guaranteed that the same trace in K<sup>P</sup> is used to match all traces in KQ. However, according to Theorem 1, it is guaranteed that if ⟨K<sup>P</sup> , KQ⟩ |= ∃π.∀π ′ . □Pred, then there exists such a single lasso trace t<sup>p</sup> in K<sup>P</sup> as the witness of the satisfaction. We therefore defne a second notion of simulation, denoted SIMEA, as follows. Let t<sup>p</sup> = sp(1)sp(2). . . sp(n). . . sp(ℓ) be a lasso trace in K<sup>P</sup> (where sp(ℓ) closes to sp(n), that is, (sp(ℓ), sp(n)) ∈ δ<sup>P</sup> ). A relation R from t<sup>p</sup> to K<sup>Q</sup> is considered as a SIMEA from t<sup>p</sup> to KQ, if the following holds:


If there exists a lasso trace tp, then we say that there exists SIMEA from K<sup>P</sup> to KQ. Notice that the third requirement in fact unrolls K<sup>Q</sup> in a way that guarantees that for every trace t<sup>q</sup> in KQ, it holds that (tp, tq) |= □Pred. Therefore, the following result holds:

Lemma 2. Let K<sup>P</sup> and K<sup>Q</sup> be Kripke structures, and let φ = ∃π.∀π ′ . □Pred. If there exists a SIMEA from K<sup>P</sup> to KQ, then ⟨K<sup>P</sup> , KQ⟩ |= φ.

Lemmas 1 and 2 enable sound algorithms for model-checking ∀π.∃π ′ . □Pred and ∃π.∀π ′ . □Pred hyperproperties with loop conditions. To check the former, check whether there exists SIMAE from K<sup>P</sup> to KQ; to check the latter, check for a lasso trace t<sup>p</sup> in K<sup>P</sup> and SIMEA from t<sup>p</sup> to KQ. Based on these ideas, we introduce now two SAT-based BMC algorithms.

For ∀∃ hyperproperties, we not only check for the existence of SIMAE, but also iteratively seek a small subset of S<sup>Q</sup> that sufces to simulate all states of S<sup>P</sup> . While fnding SIMAE, as for standard simulation, is polynomial, the problem of fnding a simulation with a bounded number of K<sup>Q</sup> states is NP-complete (see [28] for details). This allows us to efciently handle instances in which K<sup>Q</sup> is large. Moreover, we introduce in Subsection 4.3 the use of prophecy variables, allowing us to overcome cases in which the models satisfy the property but SIMAE does not exist.

For ∃∀ hyperproperties, we search for SIMEA by seeking a lasso trace t<sup>p</sup> in K<sup>P</sup> , whose length increases with every iteration, similarly to standard BMC techniques for LTL. Of course, in our case, t<sup>p</sup> must be matched with the states of K<sup>Q</sup> in a way that ensures SIMEA. In the worst case, the length of t<sup>p</sup> may be doubly-exponential in the sizes of the systems. However, as our experimental results show, in case of satisfaction the process can terminate much sooner.

We now describe our BMC algorithms and our SAT encodings in detail. First, we fx the unrolling depth of K<sup>P</sup> to n and of K<sup>Q</sup> to k. To encode states of K<sup>P</sup> we allocate a family of Boolean variables {xi} n <sup>i</sup>=1. Similarly, we allocate {yj} k <sup>j</sup>=1 to represent the states of KQ. Additionally, we encode the simulation relation T by creating n×k Boolean variables {simij} n <sup>i</sup>=1, k <sup>j</sup>=1 such that simij holds if and only if T(p<sup>i</sup> , q<sup>j</sup> ). We now present the three variations of encoding: (1) EA-Simulation (SIMEA), (2) AE-Simulation (SIMAE), and (3) a special variation where we enrich AE-Simulation with prophecies.

### 4.1 Encodings for EA-Simulation

The goal of this encoding is to fnd a lasso path t<sup>p</sup> in K<sup>P</sup> that guarantees that there exists SIMEA to KQ. Note that the set of states that t<sup>p</sup> uses may be much smaller than the whole of K<sup>P</sup> , while the state space of K<sup>Q</sup> must be explored exhaustively. We force x<sup>0</sup> be an initial state of K<sup>P</sup> and for xi+1 to follow x<sup>i</sup> for every i we use, but for K<sup>Q</sup> we will let the solver fll freely each y<sup>k</sup> and add constraints<sup>3</sup> for the full exploration of KQ.

• All states are legal states. The solver must only search legal encodings of states of K<sup>P</sup> and K<sup>Q</sup> (we use K<sup>P</sup> (xi) to represent the combinations of values that represent a legal state in S<sup>P</sup> and similarly KQ(y<sup>j</sup> ) for SQ):

$$\bigwedge\_{i=1}^{n} K\_P(x\_i) \land \bigwedge\_{j=1}^{k} K\_Q(y\_j) \tag{1}$$

• Exhaustive exploration of KQ. We require that two diferent indices y<sup>j</sup> and y<sup>r</sup> represent two diferent states in KQ, so if k = |KQ|, then all states are represented, where y<sup>j</sup> ≠ y<sup>r</sup> captures that some bit distinguishes the states encoded by j and r (note that the validity of states is implied by (1)):

$$\bigwedge\_{j \neq r} (K\_Q(y\_j) \land K\_Q(y\_r)) \to (y\_j \neq y\_r) \tag{2}$$

• The initial S 0 P state simulates all initial S 0 <sup>Q</sup> states. State x<sup>0</sup> is an initial state of K<sup>P</sup> and simulates all initial states of K<sup>Q</sup> (we use I<sup>P</sup> (x0) to represent a legal initial state in K<sup>P</sup> and IQ(y<sup>j</sup> ) for S 0 <sup>Q</sup> of KQ):

$$I\_P(x\_0) \land \left(\bigwedge\_{j=1}^k I\_Q(y\_j) \to T(x\_0, y\_j)\right) \tag{3}$$

<sup>3</sup> An alternative is to fx an enumeration of the states of K<sup>Q</sup> and force the assignment of y<sup>0</sup> . . . according to this enumeration instead of constraining a symbolic encoding, but the explanation of the symbolic algorithm above is simpler.

• Successors in K<sup>Q</sup> are simulated by successors in K<sup>P</sup> . We frst introduce the following formula succ<sup>T</sup> (x, x′ ) to capture one-step of the simulation, that is, x ′ follows x and for all y if T(x, y) then x ′ simulates all successors of y (we use δQ(y, y′ ) to represent that y and y ′ states are in δ<sup>Q</sup> of KQ, similarly for (x, x′ ) ∈ δ<sup>P</sup> of K<sup>P</sup> we use δ<sup>P</sup> (x, x′ )) :

$$\operatorname{succ}\_T(x, x') \stackrel{\text{def}}{=} \bigwedge\_{y=y\_1}^{y\_k} T(x, y) \to \left(\bigwedge\_{y'=y\_1}^{y\_k} \delta\_Q(y, y') \to T(x', y')\right),$$

We can then defne that xi+1 follows x<sup>i</sup> :

$$\bigwedge\_{i=1}^{n-1} \left[ \delta\_P(x\_i, x\_{i+1}) \land succ\_T(x\_i, x\_{i+1}) \right] \tag{4}$$

And, x<sup>n</sup> has a jump-back to a previously seen state:

$$\bigvee\_{i=1}^{n} \left[ \delta\_P(x\_n, x\_i) \land succ\_T(x\_n, x\_i) \right] \tag{5}$$

• Relational state predicates are fulflled by simulation. Everything relating in the simulation fts the relational predicate, defned as a function Pred of two sets of labels (we use LQ(y) to represent the set of labels on the y-encoded state in KQ, similarly, L<sup>P</sup> (x) for the x-encoded state in K<sup>P</sup> ):

$$\bigwedge\_{i=1}^{n} \bigwedge\_{j=1}^{k} T(x\_i, y\_j) \to \mathsf{Pred}(L\_P(x\_i), L\_Q(y\_j))\tag{6}$$

We use φ n,k EA for the SAT formula that results of conjoining (1)-(6) for bounds n and k. If φ n,k EA is satisfable, then there exists SIMEA from K<sup>P</sup> to KQ.

#### 4.2 Encodings for AE-Simulation

Our goal now is to fnd a set of states S ′ <sup>Q</sup> ⊆ S<sup>Q</sup> that is able to simulate all states in K<sup>P</sup> . Therefore, as in the previous case, the state space K<sup>P</sup> corresponding to the ∀ quantifer will be explored exhaustively, and so n = |K<sup>P</sup> |, while k is the number of states in KQ, which increases in every iteration. As we have explained, this allows fnding a small subset of states in K<sup>Q</sup> which sufces to simulate all states of K<sup>P</sup> (Note that here we guarantee soundness but not necessarily completeness, which will be further explained in Section 4.3).

• All states in the simulation are legal states. Again, every state guessed in the simulation is a legal state from K<sup>P</sup> or KQ:

$$\bigwedge\_{i=1}^{n} K\_P(x\_i) \land \bigwedge\_{j=1}^{k} K\_Q(y\_j) \tag{1'}$$

• K<sup>P</sup> is exhaustively explored. Every two diferent indices in the states of K<sup>P</sup> are diferent states<sup>4</sup> :

$$\bigwedge\_{i \neq r} (K\_P(x\_i) \land K\_P(x\_r)) \to (x\_i \neq x\_r) \tag{2'}$$

• All initial states in K<sup>P</sup> must match with some initial state in KQ. Note that, contrary to the ∃∀ case, here the initial state in K<sup>Q</sup> may be diferent for each initial state in K<sup>P</sup> :

$$\bigwedge\_{i=1}^{n} \bigvee\_{j=1}^{k} I\_P(x\_i) \to \left( I\_Q(y\_j) \land T(x\_i, y\_j) \right) \tag{3'}$$

• For every pair in the simulation, each successor in K<sup>P</sup> must match with some successor in KQ. For each (x<sup>i</sup> , y<sup>j</sup> ) in the simulation, every successor state of x<sup>i</sup> has a matching successor state of y<sup>j</sup> :

$$\bigwedge\_{i=1}^{n} \bigwedge\_{t=1}^{n} \delta\_P(x\_i, x\_t) \to \bigwedge\_{j=1}^{k} \left[ T(x\_i, y\_j) \to \bigvee\_{r=1}^{k} \left( \delta\_Q(y\_j, y\_r) \land T(x\_t, y\_r) \right) \right] \tag{4'}$$

• Relational state predicates are fulflled. Similarly, all pairs of states in the simulation should respect the relational Pred:

$$\bigwedge\_{i=1}^{n} \bigwedge\_{j=1}^{k} T(x\_i, y\_j) \to \mathsf{Pred}(L\_P(x\_i), L\_Q(y\_j))\tag{5'}$$

We now use φ n,k AE for the SAT formula that results of conjoining (1 ′ )-(5 ′ ) for bounds n and k. If φ n,k AE is satisfable, then there exists SIMAE from K<sup>P</sup> to KQ.

#### 4.3 Encodings for AE-Simulation with Prophecies

The AE-simulation encoding introduced in Section 4.2 is sound but not complete (i.e., the property is satisfed, yet no simulation exists). For example, when the system for the ∀ quantifer is non-deterministic, the simulation is required to match immediately the successor of the ∃ path without inspecting the future of the ∀ path. In this section, we incorporate our encodings with prophecies to resolve these kind of cases, which takes us one step towards completeness. We now illustrate with the following example.

Example 1. Consider Kripke structures K<sup>1</sup> and K<sup>2</sup> from Section 1, and HyperLTL formula φ<sup>2</sup> = ∀π.∃π ′ . (a<sup>π</sup> ↔ aπ′ ). It is easy to see that the two models satisfy φ2, since mapping the sequence of states (s1s2s3) to (q1q2q4) and (s1s2s4) to (q1q3q5) guarantees that the matched paths satisfy (a<sup>π</sup> ↔ aπ′ ). However, the technique in Section 4.2 cannot diferentiate the occurrences of s<sup>2</sup> in the two diferent cases. ⊓⊔

<sup>4</sup> As in the previous case, we could fx an enumeration of the states of S<sup>P</sup> and fx x0x<sup>1</sup> . . . to be the states according to the enumerations.

Fig. 1: Prophecy automaton for a (left) and its composition with K<sup>1</sup> (right).

To solve this, we incorporate the notion of prophecies to our setting. Prophecies have been proposed as a method to aid in the verifcation of hyperliveness [14] (see [7] for a systematic method to construct prophecies). For simplicity, we restrict here to prophecies expressed as safety automata. A safety prophecy over AP is a Kripke structure U = ⟨S, S<sup>0</sup> , δ, AP, L⟩, such that Traces(U) = AP<sup>ω</sup> . The product K×U of a Kripke structure K with a prophecy U preserves the language of K (since the language of U is universal). Recall that in the construction of the product, states (s, u) ∈ (K × U) that have incompatible labels are removed. The direct product can be easily processed by repeatedly removing dead states, resulting in a Kripke structure K′ whose language is Traces(K′ ) = Traces(K). Note that there may be multiple states in K′ that correspond to diferent states in K for diferent prophecies. The prophecy-enriched Kripke structure can be directly passed to the method in Section 4.2, so the solver can search for a SIMAE that takes the value of the prophecy into account.

Example 2. Consider the prophecy automaton shown in Fig. 1 (left), where all states are initial. Note that for every state, either all its successors are labeled with a (or none are), and all successors of its successors are labeled with a (or none are). In other words, this structure encodes the prophecy a. The product K′ <sup>1</sup> of K<sup>1</sup> with the prophecy automaton U for a is shown in Fig. 1 (right). Our method can now show that ⟨K′ 1 , K2⟩ |= φ2, since it can distinguish the two copies of s<sup>1</sup> (one satisfes a and is mapped to (q1q2q4), while the other is mapped to (q1q3q5)). ⊓⊔

## 5 Implementation and Experiments

We have implemented our algorithms using the SAT solver Z3 through its python API Z3Py [15]. The SAT formulas introduced in Section 4 are encoded into the two scripts simEA.py and simAE.py, for fnding simulation relations for the SIMEA and SIMAE cases, respectively. We evaluate our algorithms with a set of experiments, which includes all forms of quantifers with diferent sizes of given models, as presented earlier in Table 1. Our simulation algorithms beneft the most in the cases of the form ∀small ∃big. When the second model is substantially larger than the frst model, SIMAE is able to prove that a ∀∃ hyperproperty holds by exploring only a subset of the second model. In this section, besides ∀small ∃big cases, we also investigate multiple cases on each category in Table 1 to demonstrate the generality and applicability of our algorithms. All case studies are run on a MacBook Pro with Apple M1 Max chip and 64 GB of memory.

## 5.1 Case Studies and Empirical Evaluation

Conformance in Scenario-based Programming. In scenario-based programming, scenarios provide a big picture of the desired behaviors of a program, and are often used in the context of program synthesis or code generation. A synthesized program should obey what is specifed in the given set of scenarios to be considered correct. That is, the program conforms with the scenarios. The conformance check between the scenarios and the synthesized program can be specifed as a ∀∃-hyperproperty:

$$
\varphi\_{\mathsf{conf}} = \forall \pi. \exists \pi'. \bigwedge\_{p \in \mathsf{AP}} \Box \ (p\_{\pi} \leftrightarrow p\_{\pi'}),
$$

where π is over the scenario model and π ′ is over the synthesized program. That is, for all possible runs in the scenarios, there must exists a run in the program, such that their behaviors always match.

We look into the case of synthesizing an Alternating Bit Protocol (ABP) from four given scenarios, inspired by [3]. ABP is a networking protocol that guarantees reliable message transition, when message loss or data duplication are possible. The protocol has two parties: sender and receiver, which can take three diferent actions: send, receive, and wait. Each action also specifes which message is currently transmitted: either a packet or acknowledgment (see [3] for more details). The correctly synthesized protocol should not only have complete functionality but also include all scenarios. That is, for every trace that appears in some scenario, there must exist a corresponding trace in the synthesized protocol. By fnding SIMAE between the scenarios and the synthesized protocols, we can prove the conformance specifed with φconf. Note that the scenarios are often much smaller than the actual synthesized protocol, and so this case falls in the ∀small ∃big category in Table 1. We consider two variations: a correct and an incorrect ABP (that cannot handle packet loss). Our algorithm successfully identifes a SIMAE that satisfes φconf for the correct ABP, and returns UNSAT for the incorrect protocol, since the packet loss scenario cannot be simulated.

Verifcation of Model Translation. It is often the case that in model translation (e.g., compilation), solely reasoning about the source program does not provide guarantees about the desirable behaviors in the target executable code. Since program verifcation is expensive compared with repeatedly checking the target, alternative approaches such as certifcate translation [4] are often preferred. Certifcate translation takes inputs of a high-level program (source) with a given specifcation, and computes a set of verifcation conditions (certifcates) for the low-level executable code (target) to prove that a model translation is safe. However, this technique still requires extra eforts to map the certifcates to a target language, and the size of generated certifcates might explode quickly (see [4] for retails). We show that our simulation algorithm can directly show the correctness of a model translation more efciently by investigating the source and target with the same formula ϕconf used for ABP. That is, the specifcations from the source runs π are always preserved in some target runs π , which infers a correct model translation. Since translating a model into executable code implies adding extra instructions such as writing to registers, it also falls into the ∀small ∃big category in Table 1.

We investigate a program from [4] that performs matrix multiplication (MM). When executed, the C program is translated from high-level code (C) to lowlevel code RTL (Register Transfer Level), which contains extra steps to read from/write to memories. Specifcations are triples of Pre, annot,Post, where Pre, and Post are assertions and annot is a partial function from labels to assertions (see [4] for detailed explanations). The goal is to make sure that the translation does not violate the original verifed specifcation. In our framework, instead of translating the certifcation, we fnd a simulation that satisfes ϕconf, proving that the translated code also satisfes the specifcation. We also investigate two variations in this case: a correct translation and an incorrect translation, and our algorithm returns SAT (i.e., fnds a correct SIMAE simulation) in the former case, and returns UNSAT for the latter case.

Compiler Optimization. Secure compiler optimization aims at preserving input-output behaviors of an original implementation and a target program after applying optimization techniques, including security policies. The conformance between source and target programs guarantees that the optimizing procedure does not introduce vulnerabilities such as information leakage. Furthermore, optimization is often not uniform for the same source, because one might compile the source to multiple diferent targets with diferent optimization techniques. As a result, an efcient way to check the behavioral equivalence between the source and target provides a correctness guarantee for the compiler optimization.

Fig. 2: The common branch factorization example [30].

Imposing optimization usually results in a smaller program. For instance, common branch factorization (CBF) fnds common operations in an if-then-else structure, and moves them outside of the conditional so that such operation is only executed once. As a result, for these optimization techniques, checking the conformance of the source and target falls in the ∀big ∃small category. That is, given two programs, source (big) and target (small), we check the following formula:

$$
\varphi\_{\mathfrak{sc}} = \forall \pi. \exists \pi'. \ (\mathsf{in}\_{\pi} \leftrightarrow \mathsf{in}\_{\pi'}) \to \square \ (\mathsf{out}\_{\pi} \leftrightarrow \mathsf{out}\_{\pi'}).
$$

In this case study we investigate the strategy CBF using the example in Figure 2 inspired by [30]. We consider two kinds of optimized programs for the strategy, one is the correct optimization, one containing bugs that violates the original behavior due to the optimization. For the correct version, our algorithm successfully discovered a simulation relation between the source and target, and the simulation relation returns a smaller subset of states in the second model (i.e., |S ′ <sup>Q</sup>| < |SQ|). For the incorrect version, we received UNSAT.

Robust Path Planning. In robotic planning, robustness planning (RP) refers to a path that is able to consistently complete a mission without being interfered by the uncertainty in the environment (e.g., adversaries). For instance, in the 2-D plane in Fig. 3, an agent is trying to go from the starting point (blue grid) to the goal position (green grid). The plane also contains three adversaries on the three corners other than the starting

point (red-framed grids), and the adversaries move trying to

Fig. 3: A robust path.

catch the agent but can only move in one direction (e.g., clockwise). This is a ∃small ∀big setting, since the adversaries may have several ways to cooperate and attempt to catch the agent. We formulate this planning problem as follows:

$$
\varphi\_{\mathsf{rp}} = \exists \pi. \forall \pi'. \square \ (\mathsf{pos}\_{\pi} \not\leftrightarrow \mathsf{pos}\_{\pi'}).
$$

That is, there exists a robust path for the agent to safely reach the goal regardless of all the ways that the adversaries could move. We consider two scenarios, one in which there exists a way for the agent to form a robust path and one does not. Our algorithm successfully returns SAT for case which the agent can form a robust path, and returns UNSAT for which a robust path is impossible to fnd.

Plan Synthesis. The goal of plan synthesis (PS) is to synthesize a single comprehensive plan that can simultaneously satisfy all given small requirements has wide application in planning problems. We take the well-known toy example, wolf, goat, and cabbage<sup>5</sup> , as a representative case here. The problem is as follows. A farmer needs to cross a river by boat with a wolf, a goat, and a cabbage. However, the farmer can only bring one item with him onto the boat each time. In addition, the wolf would eat the goat, and the goat would eat the cabbage, if they are left unattended. The goal is to fnd a plan that allows the farmer to successfully cross the river with all three items safely. A plan requires the farmer to go back and forth with the boat with certain possible ways to carry diferent items, while all small requirements (i.e., the constraints among each item) always satisfed. In this example, the overall plan is a big model while the requirements form a much smaller automaton. Hence, it is a ∃big ∀small problem that can be specifed with the following formula:

> φps = ∃π.∀π ′ . (action<sup>π</sup> ̸↔ violationπ′ ).

<sup>5</sup> https://en.wikipedia.org/wiki/Wolf,\_goat\_and\_cabbage\_problem


Table 2: Summary of our case studies. The outcomes with simulation discovered show how our algorithms fnd a smaller subset for either K<sup>P</sup> or KQ.

#### 5.2 Analysis and Discussion

The summary of our empirical evaluation is presented in Table 2. For the ∀∃ cases, our algorithm successfully fnds a set |S ′ <sup>Q</sup>| < |SQ| that satisfes the properties for the cases ABP and CBF. Note that case MM does not fnd a small subset, since we manually add extra paddings on the frst model to align the length of both traces. We note that handling this instance without padding requires asynchornicity— a much more difcult problem, which we leave for future work. For the ∃∀ cases, we are able to fnd a subset of S<sup>P</sup> which forms a single lasso path that can simulate all runs in S<sup>Q</sup> for all cases RP and GCW. We emphasize here that previous BMC techniques (i.e., HyperQB) cannot handle most of the cases in Table 2 due to the lack of loop conditions.

## 6 Conclusion and Future Work

We introduced efcient loop conditions for bounded model checking of fragments of HyperLTL. We proved that considering only lasso-shaped traces is equivalent to considering the entire trace set of the models, and proposed two simulation-based algorithms SIMEA and SIMAE to realize infnite reasoning with fnite exploration for HyperLTL formulas. To handle non-determinism in the latter case, we combine the models with prophecy automata to provide the (local) simulations with enough information to select the right move for the inner ∃ path. Our algorithms are implemented using Z3py. We have evaluated the efectiveness and efciency with successful verifcation results for a rich set of input cases, which previous bounded model checking approach would fail to prove.

As for future work, we are working on exploiting general prophecy automata (beyond safety) in order to achieve full generality for the ∀∃ case. The second direction is to handle asynchrony between the models in our algorithm. Even though model checking asynchronous variants of HyperLTL is in general undecidable [25,5,9], we would like to explore semi-algorithms and fragments with decidability properties. Lastly, exploring how to handle infnite-state systems with our framework by applying abstraction techniques is also another promising future direction.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Reconciling Preemption Bounding with DPOR

Iason Marmanis(B) , Michalis Kokologiannakis , and Viktor Vafeiadis

> MPI-SWS, Kaiserslautern and Saarbr¨ucken, Germany {imarmanis,michalis,viktor}@ mpi-sws.org

Abstract. There are two major techniques for scaling up stateless model checking: dynamic partial order reduction (DPOR), which only explores executions that difer in the ordering of racy accesses, and preemption bounding, which only explores executions containing up to k preemptions (preemptive context-switches).

Combining these two techniques is challenging because DPOR-equivalent executions often contain a diferent number of preemptions, making it incorrect to cut explorations that exceed the preemption bound. To restore completeness, prior work has weakened the DPOR algorithm, which often results in the exploration of many redundant executions.

We propose an alternative approach. Starting from an optimal DPOR algorithm, we achieve completeness by allowing some slack on the preemptionbound of the explored executions. We prove that the required slack does not exceed the number of threads of the program (minus two), and that this upper limit is tight.

## 1 Introduction

Stateless model checking (SMC) [12] is an efective bug-fnding technique for concurrent programs that systematically explores all interleavings of the given input program. As such, it sufers from the state-space explosion problem: the number of possible interleavings of a program grows rapidly with the program size. There are two main approaches to attack this problem in the literature.


Combining the two approaches is non-trivial. Simply modifying a DPOR algorithm to discard any explored executions that exceed the desired bound k is not complete, as executions with ≤ k preemptions are missed. To restore completeness, Coons et al. [10] weaken DPOR by adding extra backtracking points, but such an

approach negates any optimality properties of the underlying DPOR algorithm, and can lead to the (redundant) exploration of multiple equivalent interleavings.

In this paper, we propose a diferent approach. We adapt a state-of-the-art optimal DPOR algorithm with polynomial memory requirements called TruSt [16] to support preemption-bounded search.

We frst observe that the preemption-bound defnition of Coons et al. [10] is overly pessimistic for incomplete executions (i.e., executions where at least one thread is enabled) in that an incomplete execution can often be extended to a complete one with a smaller preemption-bound. Updating the defnition to be more optimistic, however, does not fully resolve the issue: an intermediate execution that exceeds the bound might still be needed in order to reveal a conficting instruction that leads to the exploration of the desired execution.

Our solution is to allow the exploration of executions exceeding the bound, as long as they only exceed it by a small amount, which we call slack. For programs with N ≥ 2 threads, we show that a slack value of N − 2 sufces to maintain completeness (up to the provided bound). Unlike Coons et al. [10], our approach is optimal in the sense that it does not explore equivalent executions more than once. Although it may explore executions with larger bound than the desired one, we argue that these executions are useful, because they can still reveal bugs.

We have implemented our bounding approach in GenMC [18], a state-ofthe-art open-source stateless model checker. We show that for small preemption bounds (and despite the slack), bounded search can perform signifcantly faster than full search. Moreover, we experimentally confrm the literature observation that small bounds sufce to expose most concurrency bugs. We therefore argue that our combination of preemption bounding and DPOR is useful as a practical testing approach, which also provides certain coverage guarantees.

## 2 Background

In this section, we recall the basic DPOR approach and how prior work has tried to incorporate preemption-bounded search into it. Subsequently, we review the TruSt algorithm [16], which we later build upon to obtain our results.

## 2.1 The Basics of Dynamic Partial Order Reduction

DPOR starts by exploring one thread interleaving. In the process, it detects conficting transitions, i.e., instructions that, if executed in the opposite order, will alter the state of the system. At each state, when an earlier transition t is in confict with a possible transition t ′ that can be taken by another thread in this state, DPOR considers the execution where t ′ is fred before t. To accomplish this, DPOR adds the transition t ′ to the backtrack set of the state immediately before t was fred, to be explored later.

We illustrate DPOR by running it on the following example (Fig. 1).

$$\begin{array}{l|l} \left(r\_x\right) \ a := x & \left\| \begin{array}{l} \left(w\_1\right) \ y := 1\\ \left(w\_2\right) \ y := 2 \end{array} \right. \\ \left(\text{RR} + \text{ww}\right) \end{array} \tag{\text{RR} + \text{ww}} \text{)}$$

Fig. 1. Left-to-right DPOR exploration of rr+ww

After fring the transitions (rx) and (ry) (trace 1 ), DPOR adds transition (w1) to the backtrack set of the state after the fring of transition (rx), since transition (w1) is in confict with transition (ry). When the initial exploration is fnished (trace 2 ), DPOR backtracks to 1 and considers the second exploration option, i.e., fring transition (w1) and thus reaching 3 .

Subsequently, DPOR fres (ry) (trace 4 ) and notices that this is in confict with (w2); it then adds (w2) as an alternative exploration option for the state before the fring of (ry) in 4 . Again, DPOR fnishes with the exploration where the read instruction reads the value 1 (trace 5 ) and backtracks to 3 . Now, (w2) is fred (trace 6 ) and the algorithm continues with the remaining transition, leading to 7 . DPOR now terminates since there is no other exploration option.

This way, DPOR manages to explores all three equivalence classes (representatives 2 , 5 , 7 ) of the 6 interleavings that correspond to this program.

#### 2.2 Bounded Partial Order Reduction

Preemption bounding (PB) [25] prunes the state space by discarding executions that contain more preemptions than a given constant bound k. A preemption occurs at index i of a sequence of events τ whenever (1) events τ<sup>i</sup> and τi+1 originate from diferent threads and (2) the thread of τ<sup>i</sup> remains enabled after τ<sup>i</sup> ; in particular, τ<sup>i</sup> is not the last event of its thread.

Combining DPOR and PB is non-trivial. Specifcally, simply pruning from DPOR's exploration space any trace with more than k preemptions is incorrect because their exploration might lead to exploring traces with up to k preemptions.

To see this, consider the run of rr+ww with k = 0. DPOR reaches the state where (rx) is fred and (w1) is considered as an alternative option in the backtrack set. Firing transition (w1) will lead to trace 3 , which exceeds the bound, since there is a transition from the second thread present, while the frst thread is still enabled. By discarding this state, the execution where b = 2 (which is equivalent to 7 ) would never be considered, even though it respects the bound.

To address this issue, Coons et al. [10] conservatively add more backtrack points accounting for such bound-induced dependencies. Concretely, when the two transitions of the frst thread are fred (trace 1 ), Coons et al. [10] adds (w1) in the backtrack set not only of the state before the fring of (ry) in 2 , as in the unmodifed DPOR algorithm, but also of the initial state. Additionally, the initial transition from a state is always picked so that it is from the same thread as of the last fred transition, if possible. As a result, when the state with only (w1) being fred is reached (due to the additional backtrack point), (w2) will be fred immediately afterwards, and eventually the interleaving that corresponds to the right-to-left execution of the threads will be explored.

While this solution guarantees that no execution within the bound is lost, it weakens DPOR, i.e., it leads to the exploration of equivalent interleavings that would otherwise not be considered. In rr+ww, for k > 0, Coons et al. [10] explore interleavings that only difer in the order of (rx) and (w1).

#### 2.3 TruSt: Optimal Dynamic Partial Order Reduction

The basic DPOR algorithm described in § 2.1 does not guarantee optimality, i.e., that only one execution from each equivalent class will be explored. There are several improvements of the basic algorithm, some of which achieve optimality (e.g., [2, 17]). Here, we follow the most recent such improvement, TruSt [16], which achieves optimality with polynomial memory consumption.

TruSt represents program executions as execution graphs, a concept that appeared in previous works for DPOR under weak memory models [15, 17]. An execution graph G consists of a set of nodes G.E (a.k.a. events) representing the individual thread instructions executed, such as read events R and write events W, and three kinds of directed edges encoding the ordering between events:


For an execution graph G, we defne the following derived relations:

$$G.\mathtt{port}\triangleq \left(G.\mathtt{po}\cup\left\{\langle G.\mathtt{rf}(r),r\rangle\Big|\,r\in G.\mathtt{R}\right\}\right)^{+}\qquad\qquad\text{(causality order)}$$

$$G. \mathsf{fr} \triangleq \left\{ \langle r, w \rangle \middle| \left\langle G. \mathsf{rf}(r), w \right\rangle \in G. \mathrm{co} \right\} \tag{\text{reads-before}}$$

The causality order, porf, relates two events if there is a path of program order or read-from dependencies between them, while fr orders a read event before every write that is coherence after the one read by the read.

An execution graph is SC-consistent (sequentially consistent) if there is a total ordering of its events respecting po such that each read event reads from the immediately preceding same-location write in the total order. Equivalently, a graph is SC-consistent if porf ∪ co ∪ fr is acyclic.

Execution graphs enable the efcient reversal of many conficting events. If a write or a read event is in confict with a previous write event, there is no need to backtrack to the state before the write events is added. Instead, the new event can be directly added in the execution and either read from a co-earlier write in case of a read event, or be placed co-before the conficting write in case of a write event.

The only reversals where backtracking is necessary are those between a write event and a previously added read event: when a read event is added, it does not have the option to read from a write that has not yet been added. These reversals are referred to as backward revisits. To avoid exponential memory consumption, TruSt considers each exploration option eagerly when the new event is added, instead of maintaining backtrack sets for later exploration. In the case of backward revisits, TruSt removes the part of the execution that was added after the read event but is not in the prefx of the write event. The prefx of an event is defned as the set of events that precede it in the porf order. This allows the write event to be directly added in the execution graph. Because there is the possibility that many diferent execution graphs can lead to the same execution after a backward revisit, TruSt only considers the revisit if the events to be removed respect a maximality condition which is defned in such a way so that there will always be exactly one such set of deleted events, achieving an optimal exploration.

## 3 Bounded Optimal DPOR: Obstacles

We discuss the two main obstacles that complicate the application of preemptionbounded search to a DPOR algorithm.

#### 3.1 Pessimistic Bound Defnition

The frst problem concerns the defnition of preemptions for incomplete executions. Recall in the rr+ww example why the naive adaptation of DPOR with preemption bound k = 0 (incorrectly) does not generate the execution reading b = 2. The partial trace 3 is discarded because it contains at least one preemption according to the defnition of Musuvathi et al. [23]. (Both threads are enabled and have executed one instruction each.)

We argue that this trace should be deemed to have no preemptions because of monotonicity. Trace 3 can be extended to a full trace (namely, 7 ) that (is equivalent to one that) does not have any preemptions.

We therefore modify the defnition of preemptions as follows. A preemption occurs at index i of an event sequence τ whenever (1) events τ<sup>i</sup> and τi+1 originate from diferent threads and (2) the thread of τ<sup>i</sup> remains enabled after τ<sup>i</sup> , and has further events in the trace τi+1τi+2 ... τ<sup>|</sup>τ<sup>|</sup> . According to our new defnition, both

Fig. 2. A program and its intermediate execution that TruSt must explore in order to reach the right-to-left execution.

interleavings that are equivalent with 3 have zero preemptions, because when switching to another thread, the frst thread has no further events in the trace.

Our new defnition satisfes monotonicity and coincides with the original on complete executions. We note, however, that partial executions with k preemptions cannot always be extended to a complete execution with k preemptions. Consider, for example, trace 4 of rr+ww, which has no preemptions. Firing the only remaining transition leads to trace 5 , which has one preemption. A DPOR algorithm that employs our defnition of preemptions might thus reach states that are bound-blocked; the current explored execution respects the bound but there is no fnal execution reachable from this state that respects the bound. In our experience (see §6), bound-blocked executions do not seem to have a signifcant efect on the performance of our algorithm.

#### 3.2 Need For Slack

Monotonicity alone is not enough to incorporate bounded search in an algorithm like TruSt, without still forfeiting completeness: some executions that respect the bound might still be lost. Intuitively, since DPOR algorithms operate by detecting conficting instructions during an interleaving's exploration and reversing the confict to obtain a new interleaving, it might be the case that for the confict to be revealed, an execution that exceeds the bound needs to be explored.

We illustrate this point with the example in Fig. 2 where all the variables are initialized to zero. Consider a run of TruSt that always adds the next event from the left-most enabled thread. To reach the fnal execution that results from executing the threads from right to left, TruSt needs to pass through the execution depicted on the right of Fig. 2 before reaching this fnal execution. In the next step, the second write of the third thread will be added, which will reveal a confict with the frst read of y of the second thread. The algorithm will then perform a backward revisit, removing the events of the second thread after the frst read of y, and change the read's incoming rf edge to the new write event. The desired fnal execution will be reached after the remaining events of the second thread are added again.

It is easy to see that, while the fnal execution has zero preemptions, the depicted intermediate execution has at least one preemption, and would thus be discarded. This example can in fact be generalized by adding more threads identical to the third one; to reach the fnal right-to-left execution that has zero preemptions, TruSt must visit an execution that has at least N−2 preemptions, where N is the total number of threads. In §4, we show that this is in fact an upper limit; a fnal execution with k preemptions is always reachable through a sequence of executions that never exceed k + N − 2 preemptions. This result directly enables us to incorporate preemption-bounded search into TruSt by allowing some slack to the bound.

## 4 Recovering Completeness via Slack

Our bounded DPOR algorithm, Buster, can be seen in Algorithm 1, where we have highlighted the diferences w.r.t. to TruSt [16].

We frst discuss some additional notation used in the algorithm. First, each execution graph generated by the algorithm keeps track of the order <<sup>G</sup> in which events were added to it. Second, given a graph G and a set of events E, we write G|<sup>E</sup> for the restriction of G to E. Third, let G.cprefx(e) be the causal prefx of an event e in an execution graph G, i.e., the set of all events that causally precede it (including e itself). Formally, G.cprefx(e) △= e ′ ⟨e ′ , e⟩ ∈ G.porf<sup>∗</sup> . Fourth, a subscript loc(a) restricts a set of events to those that access the same location as event a. Fifth, the function SetRF(G, a, w) adds an rf edge from w to a and SetCO(G, wp, a) places a immediately after w<sup>p</sup> in co. Finally, we defne the traces of an execution graph as the linearizations of (G.porf ∪ G.co ∪ G.fr) on G.E. We lift the defnition of preemptions to an execution graph G: preemptions(G) is the minimum number of preemptions in the traces of G.

Apart from only exploring SC-consistent executions, Buster eagerly discards executions with more preemptions than the user-provided value k plus the slack (Line 5). If both tests fail, Buster continues by picking an new event to extend the current execution (Line 6). For correctness, we fx nextP(G) to always return the event that corresponds to the left-most available thread. Depending on the type of the new event, the algorithm proceeds in a diferent way. We discuss the interesting cases of read and write events.

If the new event a is a read event, Buster simply considers every possible write event as an rf option for a (Line 13), and eagerly explores the corresponding execution. If a is a write event, frst every co placement is considered and explored (Line 15). Afterwards, Buster considers possible backward-revisits; for every read r event that is not in the causal prefx of a, the execution where r reads from a is considered, after deleting the events added after r, that are not in the causal prefx of a (Line 19). To avoid redundant revisits, only when the set of deleted events satisfes a maximality condition (Line 18), is the backward-revisit performed (see [16] for more details).

Algorithm 1 A Bounded DPOR algorithm based on TruSt [16]

```
1: procedure Verify(P, k)
2: VisitP,k (G∅)
3: procedure VisitP,k (G)
4: if ¬consistent(G) then return
5: if preemptions(G) > k + N − 2 then return
6: switch a ← nextP(G) do
7: case a = ⊥
8: return "Visited full execution graph G"
9: case a ∈ error
10: exit("Visited erroneous execution graph G")
11: case a ∈ R
12: for w ∈ G.Wloc(a) do
13: VisitP,k (SetRF(G, a, w))
14: case a ∈ W
15: VisitCOsP,k (G, a)
16: for r ∈ G.Rloc(a) \ G.cprefx(a) do
17: Deleted ← {e ∈ G.E | r <G e} \ G.cprefx(a)
18: if ∀e ∈ Deleted ∪ {r}. IsMaximallyAdded(G, e, a) then
19: VisitCOsP,k (SetRF(G|G.E\Deleted , r, a), a)
20: case
21: VisitP,k (G)
22: procedure VisitCOsP,k (G, a)
23: for wp ∈ G.Wloc(a) do VisitP,k (SetCO(G, wp, a))
```
#### 4.1 Properties of TruSt

We now present some key properties of the TruSt algorithm, i.e., Algorith 1 without Line 5, that are used to prove Buster's correctness (Theorem 1).

From TruSt's correctness argument, we know that every SC-consistent execution G<sup>f</sup> has exactly one sequence of Visit<sup>P</sup> calls that leads to it. We call the sequence of the corresponding graphs a production sequence for G<sup>f</sup> .

Given two SC-consistent graphs G and G′ , we say that G is a prefx of G′ , and write G ⊑ G′ , if G′ |G.<sup>E</sup> = G. Intuitively, G is a prefx of G′ if we can construct G′ from G, by adding the missing events in some order for some rf and co.

Let a maximal step of an execution G be a execution that results from extending a thread of G by an event e in a maximal way, i.e., if e ∈ R, then e is made to read from the co-latest event and if e ∈ W, then e is placed at the end of co. We write G → G′ when G′ is a maximal step of G, and G →<sup>e</sup> G′ when G → G′ and e is the added event. We say that a sequence of maximal steps is non-decreasing when the sequence of the thread identifers of the added events is non-decreasing. Finally, we write tid(e) for the thread identifer of an event e.

A key property of TruSt (stated in Prop. 1) is that every execution G in the production sequence of an SC-consistent execution G<sup>f</sup> is either a prefx of G<sup>f</sup> , or it contains a read event r that does not read from the "correct" write, but there is a prefx Gˆ of G<sup>f</sup> that can by extended to G by a non-decreasing sequence of maximal steps starting with r and not including events of at least one thread to the right of r.

Proposition 1. Let S be the production sequence of an SC-consistent fnal execution G<sup>f</sup> , and G be an execution in S. Then, either G ⊑ G<sup>f</sup> or there exists an execution G<sup>b</sup> that is before G in S, a read event r = nextP(Gb), a thread t > tid(r) and an execution <sup>G</sup><sup>ˆ</sup> such that <sup>G</sup><sup>b</sup> <sup>⊑</sup> <sup>G</sup><sup>ˆ</sup> <sup>⊑</sup> <sup>G</sup><sup>f</sup> <sup>|</sup><sup>G</sup>b.E∪G<sup>f</sup> .cprefx(r) , G<sup>f</sup> |<sup>G</sup><sup>f</sup> .cprefx(G<sup>f</sup> .rf(r)) ̸⊑ G, there is a non-decreasing sequence of maximal steps s.t. Gˆ →r→<sup>∗</sup> G, and ∀e ∈ G.E \ G. ˆ E. tid(e) ̸= t.

Intuitively, TruSt tries to construct G<sup>f</sup> by exploring an increasing sequence of its prefxes. This is not always possible, because when a read event r is added to Gb, the write event w that it should read from might not yet be present in Gb. In that case, r is made to read from another write and is later revisited by w leading to the execution G′ <sup>b</sup> = G<sup>f</sup> |<sup>G</sup>b.E∪G<sup>f</sup> .cprefx(r) , which is a prefx of G<sup>f</sup> . It is possible that additional backward revisit steps may happen between G<sup>b</sup> and G′ b . Due to maximality, however, for every intermediate execution G in the production sequence between G<sup>b</sup> and G′ b , there will be an execution G<sup>b</sup> ⊑ Gˆ ⊑ G′ b that can be extended to G by a sequence of non-decreasing maximal steps. Execution Gˆ is exactly the part of G that is not deleted or revisited in a later step in S. Hence, if w is the frst write that performed a backward revisit in S after G, then the events of thread t = tid(w) are already included in Gˆ. Finally, it can be shown that t is to the right of r. The formal proof of this proposition can be found in the extended version of this paper [22].

#### 4.2 Correctness of Slacked Bounding

To see why executions in the production sequence of a graph G<sup>f</sup> can have at most preemptions(G<sup>f</sup> ) + N − 2 preemptions, we start with a defnition. A witness of a graph G is a trace of G that contains preemptions(G) preemptions.

Next, we observe that preemptions are monotone w.r.t. execution prefxes. That is, if an execution G requires a certain number of preemptions to be produced, a larger execution G′ ⊒ G requires at least that many preemptions.

Lemma 1. If G, G′ are SC-consistent and G ⊑ G′ , then preemptions(G) ≤ preemptions(G′ ).

To prove this, take a witness of G′ and restrict to the events of G, thereby obtaining a witness of G. The restriction can only remove preemptions.

Further, we note that the number of preemptions of an execution is unafected if we extend its last executed thread with a maximal step; if a maximal step adds an event to a diferent thread, the number is increased by at most one.

Lemma 2. Let G and G′ be SC-consistent executions and r ∈ G′ .E such that G →r→<sup>∗</sup> G′ . Then, preemptions(G′ ) ≤ preemptions(G) + S, where S is the number of threads that where extended to obtain G′ from G.

Proof. Consider a witness w of G and extend by appending the missing events in the same order they were added in the sequence of maximal steps. Notice that, by construction of the maximal step, the resulting sequencing is a trace of G′ . Each time we add an event e in the trace, such that the last event of of the trace was not in the thread of e, we increase the preemption-bound by one: a thread was previously considered as completed, but was now extended with a new event. However, this can only happen S times: the maximal steps keep adding events of the same thread and when another thread is picked, the frst is not extended again (the maximal steps are non-decreasing). This gives us a trace of G′ with at most preemptions(G) + S preemptions, which concludes our proof.

We can now prove that Buster is complete, i.e., it visits every full, SCconsistent execution that respects the bound.

Theorem 1. Verify(P, k) visits every full, SC-consistent execution G<sup>f</sup> of P with preemptions(G<sup>f</sup> ) ≤ k.

Proof. Consider a full, SC-consistent execution G<sup>f</sup> of P with at most k preemptions. From the completeness of TruSt, we know that a run of Algorithm 1 without the test on Line 5 will visit G<sup>f</sup> . It thus sufces to show that for every execution G in the production sequence of G<sup>f</sup> has at most k + N − 2 preemptions, where N is the number of threads of P. If G ⊑ G<sup>f</sup> , then from Lemma 1 preemptions(G) ≤ preemptions(G<sup>f</sup> ) ≤ k.

Otherwise, from Prop. 1, there exists an execution G<sup>b</sup> that is before G in the production sequence of G<sup>f</sup> and an execution Gˆ, such that G<sup>b</sup> ⊑ Gˆ ⊑ G<sup>f</sup> |<sup>G</sup>b.E∪G<sup>f</sup> .cprefx(r) , nextP(Gb) = <sup>r</sup> <sup>∈</sup> <sup>R</sup>, <sup>G</sup><sup>f</sup> <sup>|</sup><sup>G</sup><sup>f</sup> .cprefx(G<sup>f</sup> .rf(r)) ̸⊑ <sup>G</sup>, <sup>G</sup><sup>ˆ</sup> <sup>→</sup>r→<sup>∗</sup> <sup>G</sup>, and no events in G.E \ G. ˆ E are in thread t, for some thread t to the right of r.

From the last two properties and Lemma 2 we have preemptions(G) ≤ k+N −1 since it is preemptions(Gˆ) ≤ preemptions(G<sup>f</sup> ) (Gˆ ⊑ G<sup>f</sup> and Lemma 1) and at most N − 1 threads are extended from Gˆ to G.

To complete the proof, we will prove that preemptions(G) = k + N − 1 leads to contradiction. The equality implies that Gˆ had k preemptions and that N−1 threads were extended in the maximal steps from Gˆ to G, and all of them increased the preemptions by one. The sequence of maximal steps from Gˆ to G is non-decreasing and starts with the thread of r. Since there are at most N threads, N−1 are extended, and at least one thread to the right of t is not extended, r is in the leftmost thread.

Let t<sup>r</sup> be the leftmost thread, G′ b △= G<sup>f</sup> |<sup>G</sup>b.E∪G<sup>f</sup> .cprefx(r) , and w △= G<sup>f</sup> .rf(r). From the proof of TruSt, we can infer that all events of G<sup>b</sup> are in the porf-prefx of the last event of tr. It is G<sup>f</sup> |<sup>G</sup><sup>f</sup> .cprefx(w) ̸⊑ Gb: the opposite, together with <sup>G</sup><sup>b</sup> <sup>⊑</sup> <sup>G</sup><sup>ˆ</sup> <sup>⊑</sup> <sup>G</sup>, contradicts <sup>G</sup><sup>f</sup> <sup>|</sup><sup>G</sup><sup>f</sup> .cprefx(w) ̸⊑ <sup>G</sup>. Since <sup>G</sup><sup>b</sup> is in the production sequence of G<sup>f</sup> , G<sup>b</sup> ⊑ G<sup>f</sup> , nextP(Gb) = r, and G<sup>f</sup> |<sup>G</sup><sup>f</sup> .cprefx(w) ̸⊑ Gb, TruSt will eventually add the write w △= G<sup>f</sup> .rf(r) and revisit the read r, reaching the

execution G′ <sup>b</sup> ⊑ G<sup>f</sup> that contains all events added before r, i.e., the events of Gb, the events in the porf-prefx of r, and r. Hence, all events in G′ b .E \ {r} are in the porf-prefx of r, which implies that any witness of G′ b ends with r.

Since G′ <sup>b</sup> ⊑ G<sup>f</sup> , any witness t of G′ b has at most k preemptions. Let G′ be the execution G′ <sup>b</sup> without <sup>r</sup>, and <sup>G</sup>′′ the unique execution s.t. <sup>G</sup><sup>ˆ</sup> <sup>→</sup><sup>r</sup> <sup>G</sup>′′. Removing the last event r from t gives us a trace t ′ of G′ with at most k preemptions. If t ′ ends with an event of tr, then we can restrict t ′ to the events of Gˆ and add r at the end, obtaining a trace of G′′ with at most k preemptions. Otherwise, t ′ does not end with an event of tr, and thus trace t has one more preemption than t, i.e., t ′ has at most k − 1 preemptions. Then, we can again restrict t ′ to the events of Gˆ and add r a the end, obtaining again a trace of G′′ with at most k preemptions. This contradicts our assumption that preemptions(Gˆ) = k and all N−1 threads that are extended from Gˆ increase the number of preemptions, since the frst thread t<sup>r</sup> can be extended without incurring any more preemptions.

Buster inherits TruSt's optimality, as it only explores a subset of the executions that TruSt does. Here, optimality refers to avoiding redundant work; due to the slack, Verify(P, k) can also visit executions more than k preemptions.

Theorem 2. Verify(P, k) explores each graph G of a program P at most once.

## 5 Implementation

We have implemented Buster on top of the GenMC tool [18], which implements the TruSt algorithm [16]. Since GenMC supports weak memory models and the standard notion of preemption bounding only makes sense for sequential consistency, we enforce SC in our benchmarks by using only SC memory accesses and selecting GenMC's RC11 model [20].

The bulk of our modifcations to GenMC concern the checking of whether the preemption-bound of an execution G exceeds a value k. Generally, deciding whether the preemption-bound of a Mazurkiewicz trace exceeds a value is an NP-complete problem [23]. We use an adaptation of the bound computation in Musuvathi et al. [23] to execution graphs, but instead of recursively computing preemptions(G) (and cache computations across calls to amortize the cost), we recursively compute the predicate Φ(G, k) △= preemptions(G) ≤ k. The beneft of this method is that we can avoid calculating preemptions(G) exactly when its value exceeds the desired bound. Furthermore, there is no additional state that needs to be stored; Buster remains stateless.

As an optimization, we use as slack (Line 5) the minimum between N−2 and the number of threads that have no deletable events; an event is not deletable if it is in the porf-prefx of a write that backward revisited. Intuitively, the events that are added in G to reach Gˆ (Prop. 1) are the events that will later be deleted to eventually reach a graph that is a prefx of the fnal graph G<sup>f</sup> .


Table 1. Buggy benchmarks. An ✗ indicates that an error was found.

## 6 Evaluation

To evaluate Buster, we answer the following questions:


To that end, we evaluate Buster against GenMC on a diverse set of benchmarks. Unfortunately, we cannot include the approach of Coons et al. [10] in our comparison because their implementation is not available.

We can draw two major conclusions from our evaluation. First, most bugs do manifest with a small number of preemptions (≤ 2), an observation that has been made in the literature before [25, 27]. Second, even though the bound calculation can be fairly expensive expensive, for small bounds Buster outperforms GenMC and can fnd bugs faster than GenMC.

Experimental Setup We conducted all experiments on a Dell PowerEdge M620 blade system with two Intel Xeon E5-2667 v2 CPU (8 cores @ 3.3 GHz) and 256GB of RAM. We used LLVM 11.0.1 for GenMC and Buster. All reported times are in seconds. We set a timeout limit of 30 minutes.

#### 6.1 Bound and Bug Manifestation

To validate that most bugs require a small number of preemptions, we run Buster and GenMC on three sets of benchmarks:


Table 2. Buggy CD benchmarks. An ✗ indicates that the error was found.


In all cases, we confgure Buster to disregard any errors that occur in executions that exceed the bound and are explored due to the slack. We note that this confguration may delay bug fnding, since Buster may by chance quickly come across a buggy execution with more than k preemptions (due to slack) before fnding any buggy execution with up to k preemptions. Nevertheless, we follow it to ensure that the bugs found arise in executions with up to the desired number of preemptions, so as to be able to validate the claim that bugs manifest in executions with a small number of preemptions.

Table 1 reports our outcomes on the frst two classes of benchmarks. As can be seen, Buster was able to fnd most bugs using a bound of 1. In fact, for most benchmarks, Buster found the bug before exploring a complete execution, hence the "0 ✗" entries in the table. The only benchmarks, where Buster needs a bound greater that 1 are the synthetic benchmarks triangular, which needs a bound of 8, as it was specifcally designed to make the bug discovery difcult and push model checkers to their limits; reorder-20 and twostage-100, which have a large number of threads (20 and 100, respectively). Buster times out on the latter two benchmarks because the large number of threads put a lot of stress in the bound checking procedure. We note that for twostage-100, GenMC also fails to terminate within the time limit.

Table 2 reports our results for our CD benchmarks. For these benchmarks, we have taken CD implementations from the GenMC test suite, and induced bugs into them by randomly dropping a synchronization instruction or replacing a CAS instruction with a normal write or an unconditional exchange instruction, thereby introducing a possible atomicity violation. We then construct medium-


Table 3. Buster and GenMC comparison on safe data structure benchmarks.

sized clients (with 2-3 threads and up to 12 operations per thread) of these data structures that check for their intended semantics (for example, that a queue has FIFO semantics). In all cases, the induced bugs lead to violations of the assertions in the client programs, and occasionally even to memory errors. Buster can fnd these bugs easily; a bound of k = 2 sufces to expose them. By contrast, GenMC times out for most of these benchmarks, as their state space is enormous.

#### 6.2 Comparison with Plain DPOR on Safe Benchmarks

We have already seen that modulo specially crafted synthetic benchmarks, a small preemption bound is sufcient for fnding bugs in practice. Moreover, Buster is pretty good at fnding such bugs in concurrent data structures. We now evaluate the application of Buster on a collection of safe benchmarks. For this purpose, we use diferent variations of the benchmarks of Table 2 (after repairing them so that no assertion is violated), as well as a few locking benchmarks.

Table 3 compares the performance of Buster for small values of k and GenMC. As it can be seen, GenMC struggles with these benchmarks, whereas Buster with k = 2 (and often also with k = 3) terminates fairly quickly. This is because only a small fraction of the total executions of sizeable benchmarks have few preemptions. Therefore restricting the search to only those executions makes Buster run much faster than GenMC, and guarantees that the program under consideration does not have any common bugs.

In the last column of Table 3 we include the maximum value of k such that Buster terminates faster than GenMC, for the benchmarks that terminate

under GenMC. In most cases Buster is faster than GenMC even for k > 3. For the dglm-fifo benchmarks Buster is only faster for k ∈ {0, 1}, because for these benchmarks a small k sufces to fully explore the state space.

#### 6.3 Bound Calculation Overhead

We now measure the cost of checking that each encountered execution is below the specifed bound. As we discussed in §5, checking whether an execution graph's preemption-bound exceeds a value is a NP-complete problem, and thus we expect this calculation to threaten the performance of our tool.

To carefully account for this cost, we compare Buster against the baseline GenMC implementation on benchmarks where preemption bounding does not reduce the number of executions that are explored. In Line 4, we report results on simple CD clients that have only one operation per thread of the Treiber stack [28] and the TTAS lock [13]. The clients are designed so that Buster can explore the full set of program executions with a small bound k. We sufx the name of the benchmarks with the number of writer and reader threads for the Treiber stack and the total number of threads for TTAS.

Column b contains the minimal number of the bound k for which Buster explores the same number of executions as GenMC does. Note that since these benchmarks contain several threads, exploration up to a certain bound (e.g., k = 0) does not mean that only executions with k preemptions are visited; due to slack, executions with more preemptions may be visited, and so it is possible for the exploration to cover the entire state space for a smaller bound than intrinsically necessary. In the subsequent columns we report the time overhead (percentage) for bounds k = b, k = b + 1, and k = b + 2 w.r.t. to GenMC's execution time, which is visible on the last column. The maximum overhead is observed for k = b (the minimal value sufcient to cover the entire state space). This is expected because k = b places the most burden on the calculation of whether the number of preemptions in a given execution are below k. For larger k values, the overhead drops because it is easier to show that the number of preemptions are below the bound; one does not have to calculate the number of preemptions of an execution precisely. Overall, for the Treiber stack benchmark, the overhead introduced by calculating the bounds is fairly low and does not exceed the 23% of the execution time of GenMC. For the plain runs of ttas-lock, the maximal overhead is a bit larger, up to 38%. We note, however, that such overhead only occurs in clients with a large number of threads (7); smaller clients are not afected as much.

#### 6.4 Overhead due to Bound-Blocked Executions

Finally, we measure the overhead caused by bound-blocked executions, by evaluating how often they arise in practice. Specifcally, we ran Buster on GenMC's test suite for various preemption-bound values, as well as on the safe CD clients used in § 6.2, and counted the number of such bound-blocked executions.


Table 4. Overhead w.r.t. to GenMC (left) and blocking in benchmarks (right).

For GenMC's test suite, the results are summarized in table 4 (right). We have restricted out attention to the runs with at least 10 executions, so that our results are not skewed by benchmarks that have very few executions. We have also excluded 8 benchmarks from the test suite that use barriers because they are currently not supported by our tool. As it can be seen, bound-blocked executions are rare: most runs lead to one bound-blocked execution, and only 6 lead to more than 8 bound-blocked executions. Bound-blocked executions are on average no more than 6% of the total number of executions explored.

For the CDs clients, bound-blocked executions are even more rare; out of the 22 clients, Buster encounters bound-blocked executions in only 4 of them, for some k. We exclude again from the discussion runs with very few executions. From the remaining runs, only two encounter a considerable number of bound-blocked executions that become negligible as the bound is increased: around 10% for k = 1 and less than 1% for k = 2

## 7 Related Work

There is a large body of work that has improved the original DPOR algorithm of Flanagan et al. [11]. Abdulla et al. [2] introduced the frst optimal DPOR algorithm, which, however, sufers from possibly exponential memory consumption. Kokologiannakis et al. [16] developed TruSt, which is the frst optimal DPOR algorithm that consumes polynomial memory.

Agarwal et al. [6], Chalupa et al. [8], Chatterjee et al. [9], and Huang [14] have extended DPOR for partitions coarser than the one we have focused in this paper, i.e., Mazurkiewicz traces. Abdulla et al. [1, 4, 5] consider DPOR under various weak memory models, while the works of Kokologiannakis et al. [16, 17, 19] provide a DPOR algorithm that is parametric in the choice of the memory model, provided it respects some basic properties.

Qadeer et al. [25] showed the decidability of context-bound verifcation of concurrent boolean programs. Musuvathi et al. [24] propose iterative context bounding, a search algorithm that prioritizes executions with fewer preemptions. Musuvathi et al. [23] combine partial-order reduction with a preemption-bound search, and prove that judging whether the preemption-bound of a Mazurkiewicz trace exceeds a certain value is an NP-complete problem.

To our knowledge, the only attempt to combine DPOR and preemption bounding is by Coons et al. [10], who identify the difculty of maintaining completeness of the exploration, and resolve it by weakening DPOR.

Abdulla et al. [3] and Atig et al. [7] have extended the notion of preemption bounding to weak memory models. We leave a possible extension of our approach to weak memory models for future work.

Acknowledgments We thank the anonymous reviewers for their valuable feedback. This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 101003349).

## 8 Data-Availability Statement

All supplementary material is available at [22]. The artifact is also available at [21].

## References


Proc. ACM Program. Lang. 6.POPL (Jan. 2022). doi: 10.1145/3498711. url: https://doi.org/10.1145/3498711.


104 I. Marmanis et al.

[28] R. Kent Treiber. Systems Programming: Coping with Parallelism. Tech. rep. Technical Report RJ5118, IBM, 1986. url: https://dominoweb.draco. res.ibm.com/58319a2ed2b1078985257003004617ef.html.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Optimal Stateless Model Checking for Causal Consistency

Parosh Abdulla<sup>1</sup> ID , Mohamed Faouzi Atig<sup>1</sup> ID , S. Krishna<sup>2</sup> ID , Ashutosh Gupta<sup>2</sup> , and Omkar Tuppe2()

> <sup>1</sup> Uppsala University, Uppsala, Sweden {parosh,mohamed\_faouzi.atig}@it.uu.se 2 IIT Bombay, Mumbai, India {krishnas,akg,omkarvtuppe}@cse.iitb.ac.in

Abstract. We present a framework for efficient stateless model checking (SMC) of concurrent programs under three prominent models of causal consistency, CCv, CM, CC. Our approach is based on exploring traces under the program order po and the reads from rf relations. Our SMC algorithm is provably optimal in the sense that it explores each po and rf relation exactly once. We have implemented our framework in a tool called Conschecker. Experiments show that Conschecker performs well in detecting anomalies in classical distributed databases benchmarks.

## 1 Introduction

Traditionally, distributed shared memories ensure that all processes in the system agree on a common order of all operations on memory. Such guarantees are provided by sequential consistency (SC) [33], and by linearizable memory [26]. However, providing these consistency guarantees entails access latencies, making them inefficient for large systems. There is a tradeoff in providing strong consistency guarantees while ensuring low latency and this presents significant efficiency challenges. There is a large body of work which suggests that a systematic weakening of memory consistency can reduce the costs of providing consistency. Weakened consistency guarantees admit more concurrent behaviours than SC or linearizability. To this end, Lamport [32] proposed causal consistency which provides an ordering among events in a distributed system in which processes communicate via message passing. This has been adapted [7] to a setting of reads and writes in a shared memory environment. In this setting, the return values of reads must be consistent with causally related reads and writes. As causality only orders events partially, the reading processes can disagree on the relative ordering of concurrent writes. This makes concurrent writer processes independent, reducing the costs of synchronization.

Several efforts have been made to formalize causal consistency [16], [25], [39] [40], [7], [15], [10], [8], [38] and there are many implementations [9], [20], [21] satisfying this criterion as opposed to strong consistency (linearizability).

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. https://doi.org/10.1007/978-3-031-30823-9\_6 105–125, 2023.

While strong consistency makes it easier to program than weak ones, they require costly implementations. Weak memories may be easier to implement, but much harder to program. An acceptable medium which has emerged over the years are three important notions in causal consistency, respectively causal consistency (CC) [15], [25], causal convergence (CCv) [16], [39], [15], [25] and causal memory (CM) [7], [39], [15], [25].

The focus of this paper is the verification of shared memory programs under causal consistency. We consider the three variants mentioned above. We propose a stateless model checking (SMC) framework that covers all three variants. SMC is a successful technique for finding concurrency bugs [23]. For a terminating program, SMC systematically explores all process schedulings that are possible during runs of the program. The number of possible schedulings grows exponentially with the execution length in SMC. To counter this and reduce the number of explored executions, the technique of partial order reduction [18,22] has been proposed. This has been adapted to SMC as DPOR (dynamic partial order reduction). DPOR was first developed for concurrent programs under SC [1,41]. Recent years have seen DPOR adapted to language induced weak memory models [28,37],[5], as well as hardware-induced relaxed memory models [3,46]. To the best of our knowledge, DPOR algorithms have not been developed for causal consistency models. The goal of this paper is to fill this gap.

DPOR is based on the observation that two executions are equivalent if they induce the same ordering between conflicting events, and hence it is sufficient to consider one such execution from each equivalence class. Under sequential consistency, these equivalence classes are called Mazurkiewicz traces [34], while for relaxed memory models, the generalization of these are called Shasha-Snir traces [42]. A Shasha-Snir trace characterizes an execution of a concurrent program by the relations (1) po program order, which totally orders events of each process, (2), rf reads from, which connects each read with the write it reads from, (3) co coherence order, which totally orders writes to the same shared variable. DPOR can be optimized further by observing that the assertions to be verified at the end of an execution does not depend on the coherence order of shared variables, and hence it suffices to consider traces over po − rf. Based on this observation, the DPOR algorithms for programs under the release-acquire semantics (RA) and SC [5], [4] explores traces with po, rf and co where the co edges are added on the fly. The equivalence classes are considered wrt po − rf, reducing the number of distinct traces to be analyzed.

Contributions. We propose a DPOR based SMC algorithm for all three consistency models CC, CCv, CM which explores systematically, all the distinct po-rf traces covering all possible executions of the program. We develop a uniform algorithm for all three models which is sound and complete : that is, all traces explored are consistent wrt the model X ∈ {CC, CCv, CM} under consideration, and all such consistent traces are explored. Moreover, our algorithm is optimal in the sense that, each consistent po-rf trace is explored exactly once. One of the key challenges during the trace exploration is to maintain the consistency of the traces wrt the model under consideration. We tackle this by defining a trace semantics which ensures that the traces generated in each step only contain edges which will be present in any consistent trace. We implement our algorithms in a tool Conschecker which is, to the best of our knowledge, the first of its kind to perform SMC on the three prominent causal consistency models CC, CCv, CM. Conschecker checks for assertion violation of programs under CC, CCv, CM. We evaluate the correctness of our tool on CC, CCv, CM by simulating these models on the memory model simulator Herd [8] and validating our outcomes with theirs. Then we proceed with experimental evaluation on a wide range of benchmarks from distributed databases. We showed that (i) Conschecker correctly detects known consistency bugs [13], [14], [12] and [11] under CCv, CM, CC, (ii) Conschecker correctly detects known assertion violations in applications [19], [27], [12], [36]. We also did a stress test of Conschecker on some SV-COMP benchmarks and parameterized benchmarks which resulted in a large number (6 million) of traces.

Related Work. SMC has been implemented in many tools CHESS [35], Concuerror [17], VeriSoft [24], Nidhugg [3], CDSChecker [37], RCMC[28], GenMC [30], rInspect [46] and Tracer [5]. While most of these work with either Mazurkewicz traces or po−rf traces, [6] proposes a RVF-SMC algorithm where the value read is used to decide equivalence of two runs.

In recent years, there has been much interest in DPOR algorithms : [4] for SC, [30] for the release acquire semantics, [43] for C/C++, and [29] for TSO, PSO and RC11. It is known that CC is weaker than RA, CCv is stronger than RA while CM is incomparable with RA [31]. In conclusion, all the above memory models are different from CC, CCv, CM. Hence we cannot reuse any of the existing DPOR algorithms.

Recent work on causal consistency [15] studies the complexity of checking whether one execution (all executions) of a program under CC, CCv, CM is consistent. They show that checking if an execution is consistent is NP-completeness, while the question of checking if all executions are consistent is undecidable. [11], [12] explore the robustness wrt SC, of transactional programs under CC, CCv, CM. However, none of these papers propose a DPOR algorithm for CC, CCv, CM.

## 2 Preliminaries

Programs We consider a program P consisting of a finite set T of threads (processes) that share a finite set X of (shared) variables, ranging over a domain V of values that includes a special value 0.

A process has a finite set of local registers that store values from V. Each process runs a deterministic code, built in a standard way from expressions and atomic commands, using standard control flow constructs (sequential composition, selection, and bounded loop constructs). Throughout the paper, we use x, y for shared variables, a, b, c for registers, and e for expressions. Global statements are either writes x := e to a shared variable, or reads a := x from a shared variable. Local statements only access and affect the local state of the process and include assignments a := e to registers, and conditional control flow constructs.

Note that expressions do not contain shared variables, implying that a statement accesses at most one shared variable.

The local state of a process proc ∈ T is defined by its program counter and the contents of its registers. A configuration of P is made up of the local states of all the processes. The values of the shared variables are not part of a configuration. A program execution is a sequence of transitions between configurations, starting with the initial configuration γ init. Each transition corresponds to one process performing a local or global statement. A transition between two configurations γ and γ 0 is of form γ ` γ 0 , where the label ` describes the interaction with shared variables. The label ` is one of three forms: (i) hproc, εi, indicating a local statement performed by thread proc, which updates only the local state of proc, (ii) hproc,wt, x, vi, indicating a write of the value v to the variable x by the thread proc, which also updates the program counter of proc, and (iii) hproc,rd, x, vi indicating a read of v from x by the thread proc into some register, while also updating the program counter of proc. There is no constraint on the values that are used in transitions corresponding to read statements. This will allow some illegal program behaviors, which is sorted by associating runs with socalled traces, which represent how reads obtain their values from writes. A causal consistency model X ∈ {CC, CCv, CM} is formulated by imposing restrictions on traces, thereby also restricting the possible runs that are associated with them.

Since local statements are not visible to other threads, we will not represent them explicitly in the transition relation considered in our DPOR algorithm. Instead, we let each transition represent the combined effect of some finite sequence of local statements by a process followed by a global statement by the same process. For configurations γ and γ <sup>0</sup> and a label ` which is either of the form hproc,wt, x, vi or of the form hproc,rd, x, vi, we let γ ` −→ γ <sup>0</sup> denote that we can reach γ 0 from γ by performing a sequence of transitions labeled with hproc, εi followed by a transition labeled with `. Defining the relation −→ in this manner ensures that we take the effect of local statements into account, while avoiding consideration of interleavings of local statements of different threads in the analysis.

We use γ −→ γ 0 to denote that γ ` −→ γ 0 for some ` and define succ(γ) := {γ 0 | γ −→ γ <sup>0</sup>}, i.e., it is the set of successors of γ wrt. −→ . A configuration γ is said to be terminal if succ(γ) = ∅, i.e., no thread can execute a global statement from γ. A run ρ from γ is a sequence γ<sup>0</sup> `<sup>1</sup> −→ γ<sup>1</sup> `<sup>2</sup> −→ · · · `<sup>n</sup> −→ γ<sup>n</sup> such that γ<sup>0</sup> = γ. We say that ρ is terminated if γ<sup>n</sup> is terminal. We let Runs(γ) denote the set of runs from γ.

Events. An event corresponds to a particular execution of a statement in a run of P. A write event ev is given by (id, proc,wt(x, v)) where id ∈ N is the identifier of the event, proc is the process containing the event, x ∈ X is a variable, and v ∈ V is a value. This event corresponds to a process writing the value v to variable x. Likewise, a read event ev is given by (id, proc,rd(x)) where x ∈ X. This event corresponds to a process reading some value to x. The read event ev does not specify the particular value it reads; this value will be defined in a trace by specifying a write event from which ev fetches its value. For each variable x ∈ X, we assume a special write event init<sup>x</sup> = wt(x, 0) called the initializer event for x. This event is not performed by any of the processes in T , and writes the value 0 to x. We define Einit := {init<sup>x</sup> | x ∈ X} as the set of initializer events. If E is a set of events, we define subsets of E characterized by particular attributes of its events. For instance, for a variable x, we let E wt,x denote {ev ∈ E | ev.type = wt ∧ ev.var = x}.

Traces. A trace τ is a tuple hE, po, rfi, where E is a set of events which includes the set Einit of initializer events, and po (program order), rf (read-from) are binary relations on E that satisfy:

• ev po ev<sup>0</sup> if process(ev) = process(ev<sup>0</sup> ) and ev.id < ev<sup>0</sup> .id. po totally orders the events of each individual process.

• ev rf ev<sup>0</sup> if ev is a write event and ev<sup>0</sup> is a read event on the same variable, which obtains its value from ev.

We can view τ = hE, po, rfi as a graph whose nodes are E and whose edges are defined by the relations po, rf. po depicted by red solid edges captures the order in each process while rf edges are depicted as solid blue edges. We define the empty trace τ<sup>∅</sup> := hEinit, ∅, ∅i, i.e., it contains only the initializer events, and all the relations are empty.

We define when a trace can be associated with a run. Consider a run ρ of form γ0 `<sup>1</sup> −→ · · · `<sup>n</sup> −→ <sup>γ</sup>n, where `<sup>i</sup> <sup>=</sup> <sup>h</sup>proc<sup>i</sup> , ti , x<sup>i</sup> , vii, and let τ = hE, po, rfi be a trace. We write ρ |= τ to denote that the following conditions are satisfied: (i) E = {ev <sup>1</sup>, . . . , ev <sup>n</sup>}, i.e., each event corresponds exactly to one label in ρ. (ii) If `<sup>i</sup> = hproc<sup>i</sup> ,wt, x<sup>i</sup> , vii, then ev<sup>i</sup> = hid<sup>i</sup> , proc<sup>i</sup> ,wt, x<sup>i</sup> , vii, and if `<sup>i</sup> = hproc<sup>i</sup> ,rd, x<sup>i</sup> , vii, then ev<sup>i</sup> = hid<sup>i</sup> , proc<sup>i</sup> ,rd, xii. An event and its label do the same (write or read) on identical variables, and for writes, they also agree on the written value. (iii) id<sup>i</sup> = | j | (1 ≤ j ≤ i) ∧ (proc<sup>j</sup> = proc<sup>i</sup> ) |. ev.id shows how it is ordered relative to the other events of process(ev). (iv) if ev<sup>i</sup> rf ev <sup>j</sup> then x<sup>i</sup> = x<sup>j</sup> and v<sup>i</sup> = v<sup>j</sup> . (v) if init<sup>x</sup> rf ev<sup>i</sup> then v<sup>i</sup> = 0, i.e., ev<sup>i</sup> reads the initial value of x which is 0.

## 3 Causally Consistent Models

We study three variants [15] of causal consistency : CC, CCv and CM. To define the three models formally, we introduce a function that, for each model, extends a given trace uniquely by a set of new edges. Then we define the model by requiring that the extended trace does not contain any cycles. A run of the program satisfies a consistency model X if its associated extended trace has no cycles.

Let CO, called causality order represent (po ∪ rf) <sup>+</sup>. Two events e1, e<sup>2</sup> are causally related if either e<sup>1</sup> CO e<sup>2</sup> or e<sup>2</sup> CO e1.

Causal Consistency CC. We start presenting the weakest notion of causal consistency, CC [25], [7]. First we give an intuitive description of CC. In CC, events which are not causally related can be executed in different orders in different processes; moreover decisions made about these orders can be revised by each process. To illustrate, consider the program Fig.1(b). The write events wt(x, 1),wt(x, 2) are not causally related and hence can be ordered in any way.

Fig. 1. Programs showing the differences between consistency models. The v denotes the expected return value of the read event.

Fig. 2. solid red, blue edges are po,rf, wt(x, v) and rd(x) are write, read events.

Note that p<sup>b</sup> first orders x := 1 after x := 2 and reads 1 into a; it then revises this order, and orders x := 2 after x := 1 and reads 2 into b.

A trace τ does not violate CC as long as there is a causality order which explains the return value of each read event.

To capture traces violating CC, we define a relation OW (for overwrite) on writes to the same variable. For any two writes w1, w<sup>2</sup> and a read r on a same variable, if w<sup>1</sup> CO w<sup>2</sup> CO r, and w<sup>1</sup> rf r, then w<sup>2</sup> OW w1. This says that r reads the overwritten write w1, resulting in a CO ∪ OW cycle. We refer to CO ∪ OW cycles as CCcycle. We define a function extendCC(τ ) which extends a trace τ = E, po, rf by adding all possible OW edges between write events on the same variable. For a trace τ = E, po, rf, we say that τ |= CC iff extendCC(τ ) does not have a CCcycle.

Examples. Program Fig. 1(a) is not CC since there is no causality order which explains the return values of the read events. If we consider any trace (Fig. 2) of the program Fig.1(a), we find that wt(y, 1) rf r where r = rd(y), wt(x, 1) po wt(y, 1), r po wt(x, 2). Then we get wt(x, 1) CO wt(x, 2), wt(x, 2) CO r where r = rd(x) and wt(x, 1) rf r giving wt(x, 2) OW wt(x, 1) witnessing CCcycle.

Causal Convergence CCv. Under CCv, we need a total order on all write events per variable. This order, called arbitration order, is an abstraction of how conflicts are resolved by all processes to agree upon one ordering among events which are not causally related. Thus, unlike CC, a process cannot revise its ordering of the events which are not causally related, and all processes must follow one ordering. This makes it stronger than CC.

To enforce a total order between all writes, we use a new relation CF called conflict relation on all write events per variable. For all variables x ∈ V, and writes w1, w<sup>2</sup> on x and a read r = rd(x), if w<sup>1</sup> CO r, and w<sup>2</sup> rf r then w<sup>1</sup> CF w2. We define a function extendCCv(τ ) which extends a trace τ = hE, po, rfi by adding all possible OW, CF edges between write events on the same variable. Traces violating CCv exhibit a CO ∪ CF ∪ OW cycle in extendCCv(τ ), which we refer to as CCvcycle. We say that τ |= CCv iff extendCCv(τ ) does not contain a CCvcycle.

Examples. For the program Fig.1(b) and any trace τ , extendCCv(τ ) has a CCvcycle (see Fig.2) since in any trace, we have w<sup>1</sup> = wt(x, 1) CO r<sup>2</sup> where r<sup>2</sup> = rd(x) and w<sup>2</sup> rf r<sup>2</sup> for w<sup>2</sup> = wt(x, 2) giving w<sup>1</sup> CF w2. We also have wt(x, 2) CO r<sup>1</sup> where r<sup>1</sup> = rd(x) with w<sup>1</sup> rf r<sup>1</sup> giving w<sup>2</sup> CF w1. Intuitively, we cannot find a total order amongst the writes to justify the reads of 1 and 2.

However, the program Fig.1(c) has a trace τ s.t. extendCCv(τ ) does not have CCvcycle. In the corresponding run, we first allow p<sup>a</sup> to complete execution, followed by pb.

Causal Memory CM. The CM model is stronger than CC and incomparable to CCv. Like CC, in CM also, a process can diverge from another one in its ordering of events which are not causally related. However, once a process chooses an ordering of such events, it cannot revise it; this makes it stronger than CC and incomparable to CCv.

A happened before relation per process fixes the per process ordering of events. For a read/write event e in a trace, the Causal Past of e, CausalPast(e) = {e 0 | e <sup>0</sup> CO e} is the set of events which are in the causal past of e. For an event e, the happened before relation HB<sup>e</sup> [15] is the smallest relation on events which is transitive, and is such that for all events e1, e<sup>2</sup> ∈ CausalPast(e), e<sup>1</sup> CO e<sup>2</sup> ⇒ e<sup>1</sup> HB<sup>e</sup> e2. In other words, CO<sup>|</sup>CausalPast(e) ⊆ HB<sup>e</sup> : HB<sup>e</sup> contains all pairs of events obtained by restricting CO to the events in the causal past of e. For any variable x, if we have writes w1, w<sup>2</sup> on x and a read r<sup>2</sup> = rd(x) such that (i) r<sup>2</sup> = e or r<sup>2</sup> po e, w<sup>2</sup> rf r2, and w<sup>1</sup> HB<sup>e</sup> r2, then w<sup>1</sup> HB<sup>e</sup> w2, and (ii) if w<sup>1</sup> HB<sup>e</sup> w<sup>2</sup> and w<sup>1</sup> rf r2, then r HB<sup>e</sup> w2.

Let e<sup>p</sup> be the po-last event of process p: that is, for all events e in process p, e = e<sup>p</sup> or e po ep. Since HB<sup>e</sup> ⊆ HB<sup>e</sup><sup>p</sup> for all events e in process p, HB<sup>e</sup><sup>p</sup> fixes the ordering among all causally unrelated events for process p. We write HB<sup>p</sup> instead of HB<sup>e</sup><sup>p</sup> .

We define a function extendCM which extends a trace τ = hE, po, rfi by adding all possible OW, HB<sup>p</sup> edges for all processes p. Traces violating CM exhibit a OW∪HB<sup>p</sup> cycle, called a CMcycle in extendCM(τ ) for some process p. We say that τ |= CM iff extendCM(τ ) does not contain a CMcycle. See Figure 3 which motivates conditions (i), (ii) to add HB edges so that extendCM(τ ) does not contain CMcycle.

Examples. For the program Fig.1(c) and any trace τ , extendCM(τ ) contains CM cycle. Consider the read event o<sup>p</sup><sup>b</sup> = rd(x) with wt(x, 2) rf o<sup>p</sup><sup>b</sup> . Then wt(x, 1) po wt(y, 1) rf rd(y) po o<sup>p</sup><sup>b</sup> , that is, wt(x, 1) CO o<sup>p</sup><sup>b</sup> . This induces wt(x, 1) HB<sup>p</sup><sup>b</sup> o<sup>p</sup><sup>b</sup> , and wt(x, 1) HB<sup>p</sup><sup>b</sup> wt(x, 2). This results in wt(z, 1) po wt(x, 1)HB<sup>p</sup><sup>b</sup> wt(x, 2) po r where r = rd(z) with wt(z, 0) rf r. This gives wt(z, 1) HB<sup>p</sup><sup>b</sup> wt(z, 0) resulting in a cycle. However, program Fig.1(d) has a trace τ s.t. extendCM(τ ) does not contain CMcycle.

Fig. 3. Start with (a). In (b) we add the HB edge from rd(z) to wt(z, 2) following condition (ii). Then (c) is obtained on adding rd(x), wt(x, 1) rf rd(x). In contrast, (b)' does not follow condition (ii). Hence, when the rd(x) is added in (c)', wt(x, 2) is available to be read. Choosing wt(x, 2) rf rd(x) necessitates adding wt(x, 1) HB wt(x, 2) in (d)' by condition (i). This necessitates adding wt(z, 2) HB wt(z, 1) in (e)' creating CMcycle.

A run ρ satisfies a model X ∈ {CC, CCv, CM} if there exists a trace τ such that ρ |= τ and τ |= X. Define γ<sup>X</sup> := {τ<sup>X</sup> | ∃ρ ∈ Runs(γ).ρ |= τ<sup>X</sup> ∧ τ<sup>X</sup> |= X}, the set of traces generated under X from a given configuration γ.

Note. Similar to our characterization of bad traces using cycles, [15] uses bad patterns in differentiated histories to capture violations of CC, CCv, CM. Differentiated histories are posets labeled with wt(x, v) and rd(x) v such that no two events wt(x, v1) and wt(x, v2) have v<sup>1</sup> = v2. Bad patterns are characterized in [15] using the po and reads from relations on differentiated histories. Since we work with traces having po and rf, we do not require differentiated writes.

## 4 Trace Semantics

To analyse a program P under a model X ∈ {CC, CCv, CM}, all runs of P must be explored. We do this by exploring the associated traces. In fact, two runs having the same associated traces are equivalent since the assertions to be checked at the end of a run depend only on po, rf. We begin with the empty trace, and continue exploration by adding enabled read/write events to the traces generated so far. While doing this, we must ensure that the generated traces τ are s.t. τ |= X. We present two efficient operations to add a new read/write event to a trace τ obtaining a trace τ so that extendX(τ ) does not contain a Xcycle. We discuss two notions that are relevant while adding a new read event to a trace.

Readability and Visibility. For all 3 models, readability identifies the write events w from which a newly added read r can fetch its value. Visibility is used to add, in the case of CCv, new CF edges (and in the case of CM, new HB edges) that are implied by the fact that the new read event reads from w. Let τ = E, po, rf be a trace, and τ<sup>X</sup> = extendX(τ ). Let τ <sup>r</sup> <sup>X</sup> denote adding r to τX. We define the readable set readable(τ <sup>r</sup> <sup>X</sup>, r, x) for read event r from process p on variable x.


Intuitively, readable(τ r <sup>X</sup>, r, x) contains all write events which are not hidden in τ r <sup>X</sup> by other writes on x. The newly added read event r can fetch its value from a write in readable(τ r <sup>X</sup>, r, x). The visible set visible(τ r <sup>X</sup>, r, x) is defined as the set of events in readable(τ r <sup>X</sup>, r, x) which can "reach" r in τ r <sup>X</sup>. Let τ rw denote the trace obtained by adding r and w rf r to trace τ .


The trace semantics for a model X ∈ {CC, CCv, CM} is given as the transition relation −→X−tr, defined as τ<sup>X</sup> <sup>α</sup>−→X−tr <sup>τ</sup> 0 <sup>X</sup> where extendX(τ ) = τX, extendX(τ 0 ) = τ 0 <sup>X</sup>. The label α is one of (read, r, w),(write, w) representing respectively, a read r reading from a write w, and a write event w. An important property of τ<sup>X</sup> <sup>α</sup>−→X−tr <sup>τ</sup> 0 <sup>X</sup> is that if τ<sup>X</sup> does not have Xcycle, then τ 0 <sup>X</sup> also does not have Xcycle; in other words, if τ |= X, then τ 0 |= X. We now describe the transitions τ<sup>X</sup> <sup>α</sup>−→X−tr <sup>τ</sup> 0 <sup>X</sup> where extendX(τ ) = τX, extendX(τ 0 ) = τ 0 <sup>X</sup>, τ = hE, po, rfi, τ <sup>0</sup> = hE 0 , po<sup>0</sup> , rf<sup>0</sup> i. We start from the empty trace τ0, extendX(τ0) = τ0.

	- When X = CCv. Add new CF edges from all w <sup>00</sup> ∈ visible(τ r <sup>X</sup>, r, x) to w to get τ 0 X.
	- When X = CM. Add new HB<sup>p</sup> edges from all w <sup>00</sup> ∈ visible(τ r <sup>X</sup>, r, x) to w. Adding these HB<sup>p</sup> edges can result in w<sup>1</sup> HB<sup>p</sup> w<sup>2</sup> for write events w1, w<sup>2</sup> on a variable y. If we had w<sup>1</sup> rf r1, r<sup>1</sup> po r, then add r<sup>1</sup> HB<sup>p</sup> w2. When we are done adding all such HB<sup>p</sup> edges, we obtain τ 0 <sup>X</sup>. (Figure 4(iv)).

Lemma 1. If τ<sup>X</sup> = extendX(τ ) with τ |= X, and τ<sup>X</sup> <sup>α</sup>−→tr <sup>τ</sup> 0 <sup>X</sup> = extendX(τ 0 ), then τ 0 |= X for X ∈ {CC, CCv, CM}.

Efficiency and Correctness. Each step of <sup>α</sup>−→tr is computable in polynomial time . This is based on the fact that readable and visible sets are computable in polynomial time. The correctness of the trace semantics for a model X stems from the fact that it generates only those X-extensions which do not have cycles (Lemma 1). The transitions ensure acyclicity of the resultant extended traces.

Fig. 4. w1, w<sup>6</sup> are writes on y, r<sup>1</sup> is a read on y, w2, w3, w4, w<sup>5</sup> are writes on x in <sup>τ</sup>CM. Add read <sup>r</sup> on <sup>x</sup> to <sup>τ</sup>CM. <sup>w</sup>2, w3, w<sup>5</sup> <sup>∈</sup> readable(<sup>τ</sup> <sup>r</sup> CM, r, x). Choose w<sup>5</sup> rf r. Then we add w<sup>2</sup> HB w<sup>5</sup> and w<sup>3</sup> HB w5. The addition of w<sup>2</sup> HB w<sup>5</sup> results in w<sup>1</sup> HB w6. Since w<sup>1</sup> rf r1, add the HB edge from r<sup>1</sup> to w<sup>6</sup> to obtain τ CM.


## 5 DPOR Algorithm for CC, CCv, CM

We present our DPOR algorithm, which systematically explores, for any terminating program under the consistency models X ∈ {CC, CCv, CM}, all traces τ<sup>X</sup> wrt X which can be generated by the trace semantics. Enabled write events from any of the processes are added to the trace generated so far, and we proceed with the next event. For a read event r, we add r to the trace, and explore in separate branches, all possible write events w from which r can read from. Each such branch is a sequence of events also called a schedule. There may be writes w which will be added to the trace later in the exploration, from which r can also read. Such writes w are called postponed wrt r; when w is added to the trace later, the algorithm will have a branch where r can read from w . In that branch, the algorithm reorders events in the sequence s.t. w and r exchange places, and all events which are needed for w to occur are also placed before w (CreateSchedule). All generated schedules will be executed

Input: X ∈ {CC, CCv, CM} is a consistency model, τ<sup>X</sup> is an X-extension and π is an explored observation sequence. let w be last(π) and x be var(w) for i ← |π| − 1 to 1 do // look for reads r that have postponed w let r be the element at π[i] if r is a read on x∧¬(r CO w) ∧ Swappable(r) then β ← ; f lag = true; for j ← i + 1 to |π| − 1 do // get all events after r in π and precedes w in CO let ev be the element at π[j] if ev CO w then if r CO ev then f lag = f alse; break; 11 else β ← β • π[j] if f lag ∧ @β ∈ Schedules(r). β<sup>0</sup> ≈ β • w • hr, wi then Schedules(r) ← Schedules(r) ∪ {β • w • hr, wi} // r can read from w


Input: X ∈ {CC, CCv, CM} is a consistency model, τ<sup>X</sup> is a X-extension, π is an explored-observation sequence, and β is a schedule. 1 if β =6 then // explore the sequence of observations one by one

2 let β be α • β 0 choose τ 0 <sup>X</sup> : τ<sup>X</sup> <sup>α</sup>−→ <sup>X</sup>−trτ 0 <sup>X</sup> // follow write and read

3 if α = (read, r, w) then Swappable(r) ← false

<sup>4</sup> RunSchedule(X, τ <sup>0</sup> <sup>X</sup>, π • α, β<sup>0</sup> )

<sup>5</sup> else ExploreTraces(X, τX, π)

by RunSchedule. The algorithm is uniform across the models, with the main technical differences being taken care of by the respective trace semantics which guides the exploration of traces in each model.

The ExploreTraces Algorithm. This algorithm takes as input, a consistency model X ∈ {CC, CCv, CM}, an X-extension τ<sup>X</sup> and an observation sequence π. π is a sequence of events of the form (write, w) or (read, r, w). The initial invocation is with the empty trace τ<sup>0</sup> and observation sequence π = . The observation sequence is used to swap read operations with write operations that are postponed wrt them. From the initial τ0, we choose an operation from any of the processes.

If a write operation is enabled, one such is chosen non deterministically from any process, and is added to the trace according to the trace semantics, and also appended to the observation sequence, whereafter ExploreTraces is called recursively to continue the exploration (line 3). After the recursive calls have returned, the algorithm calls CreateSchedule, which finds read operations r in the observation sequence which can read from write operations w if w was performed before r. For each such read r, CreateSchedule creates a schedule for r, an observation sequence that can be explored from the point when r was performed, allowing w to occur before r so that r can read from w. When a read operation r is enabled, the set Schedules(r) is initialized (line 6). This set is updated by CreateSchedule when subsequent writes are explored. We also keep a Boolean flag Swappable(r) for each read event r. This is initialized to true, indicating that r is swappable, that is, subsequent writes can be considered for r. This flag is set to false for read events appearing in a schedule so that they are not swapped, eliminating redundant explorations. For each generated write event w from which r can read, ExploreTraces is called recursively (line 7)to continue the exploration. Once these recursive calls have returned, the set of schedules collected in Schedules(r) for the read r is considered. RunSchedule explores all schedules, where the read fetches its value from the respective write.

The CreateSchedule algorithm. The input to this algorithm is a consistency model X, a trace τ<sup>X</sup> wrt X, and an observation sequence π whose last element is a write. The algorithm looks for reads in π for which w is a postponed write. Indeed, this read r and w must be on the same variable, r must be swappable, and r must not precede w wrt CO (line 4). We begin with the closest (from the write w) such read r at position π[j]. After finding r, a schedule β is created. The schedule consists of all elements following r in π and preceding w wrt CO (line 12). It ends with w • (r, w), allowing r to read from w (line 13). This schedule is added to Schedules(r) if it does not already contain a schedule β <sup>0</sup> which has the same set of observations : Schedules(r) does not contain β <sup>0</sup> ≈ β.

The RunSchedule Algorithm. The inputs are a consistency model X, a trace τX, an observation sequence π and a schedule β. The schedule of observations in β is explored one by one, by recursively calling itself, and updating the trace. The read events in the schedule are not swappable, preventing a redundant exploration for them (schedules where these are swapped with respective writes will be created by CreateSchedule. All proofs and an illustrative example can be found in the extended version of the paper [2].

### Theorem 1. Our DPOR algorithms are sound, complete and optimal.

Soundness, Optimality and Completeness. The algorithm is sound in the sense that, if we initiate Algorithm 1 from (X, τ0, ), then, all explored traces τ are s.t. τ |= X. This follows from the fact that the exploration uses the −→X−tr relation. The algorithm is optimal in the sense that, for any two different recursive calls to Algorithm 1 with arguments (X, τ <sup>1</sup> <sup>X</sup>, π1) and (X, τ <sup>2</sup> <sup>X</sup>, π2), if τ 1 <sup>X</sup>, τ <sup>2</sup> <sup>X</sup> are extendible, then τ 1 <sup>X</sup> =6 τ 2 <sup>X</sup>. This follows from (i) for a given read r, each iteration of the for loop in line 7 will correspond to a different write, (ii) in each schedule β ∈ Schedules(e) in line 8 of Algorithm 1, the read event r reads from a write w which is different from all writes it reads from in line 7 (iii) Any two schedules added to Schedules(e) at line 14 of Algorithm 2 will be different. The algorithm explores traces of all terminating runs, and is hence complete.

## 6 Experimental Evaluation

We describe the implementation of our optimal DPOR algorithm for the causal consistency models CC, CCv, CM as a tool Conschecker, available at[45]. To the best of our knowledge, Conschecker is the first stateless model checking tool for the causal consistency models CC, CCv, CM.

Conschecker. Conschecker extends Nidhugg [3] and works at LLVM IR level accepting a C language program as input. At runtime, Conschecker controls the exploration of the input program until it has explored all the traces using the DPOR algorithm. It can detect user-provided assertion violations by analyzing the generated traces. We conduct all experiments on a Ubuntu 22.04.1 LTS with Intel Core i7-1165G7 and 16 GB RAM. We evaluate Conschecker on the following categories of benchmarks, as seen below.

Experimental Setup. We consider the following categories of benchmarks.

• A set of thousands of litmus tests (sec 6.1) generated from [8]. The main purpose of these experiments is to provide a sanity check of the correctness of Conschecker on all three consistency models.

• A collection (sec 6.2) of concurrent benchmarks taken from the TACAS competition on software verification [44]. These are small programs with 50-100 lines of code used by many tools [4], [5].

• Five applications (sec 6.3) : Voter [19], Twitter clone [27], Fusion ticket [27], two versions of Auction [36], extracted from literature on databases, and verify against assertion violations wrt the three consistency models.

• Classical database benchmarks (sec 6.4) reported in recent papers on consistency models [13], [12] and [14]. We classify these benchmarks SAFE and UNSAFE on all three models depending on whether they witness an assertion violation.

• Eight parameterized programs (sec 6.5) from [5] and [4] to study the scalability of Conschecker when increasing the number of processes, as well as read and write instructions in programs.

#### 6.1 Litmus Benchmarks

We apply Conschecker on a set of 9815 litmus benchmarks generated from [8]. Litmus tests are standard benchmark programs used by many tools running on weak memories. In these litmus tests, the processes execute concurrently, and we validate assertions on the underlying memory model, doing a sanity check for the correctness of Conschecker. We compared the observed outcomes of Conschecker on the litmus tests with expected outcomes generated from [8]. We generated the expected outcomes by simulating the CCv, and CC and CM semantics on [8] for these litmus tests. Out of the 9815 litmus tests, we found no assertion violations in 3810 under CC, CM and 3811 under CCv. Results obtained from Conschecker matched with the expected outcomes. Conschecker took <3 mins to execute on all litmus tests across models.


Table 2. Applications.


Table 3. SV-Comp Benchmarks


#### 6.2 SV-COMP Benchmarks

These benchmarks [44] consist of five programs written in C/C++ having 2 processes each, with 50-100 lines of code per process (Table 3). The main challenge in these benchmarks is the large number of traces to be explored. These benchmarks have assertion checks, and under CCv, CM, and CC all these assertions are violated. Conschecker stops exploration as soon as it detects the first assertion violation. To check the efficiency of Conschecker, we removed all assertions and let Conschecker exhaustively explore all po-rf traces. Since these benchmarks have large number of traces, they serve as a stress test.

#### 6.3 Database Applications

Table 2 reports the performance of Conschecker on a set of programs inspired from five applications extracted from the literature on distributed systems [19], [27], [12], [36]. The applications we considered are

• Voter [19] : This application is derived from a software system used to record votes from a talent show. Users can vote for any of the n contestants from any one of the m sites (processes). The application asserts that users cannot vote from multiple sites and cannot vote for multiple contestants and checks for violations of this. [19] considers 3 sites and 3 users, and we follow suit.


Table 4. Parameterized Benchmarks from [5] and [4]

• Twitter clone [27] : This is based on a twitter like service where each user has some followers. The following assertion is checked : when the user tweets, the tweet ID must be added to the follower's time line exactly once if the user did not remove his tweet. We considered 3 users using 3 processes, each process has 10 tweet IDs and 6 followers.

• Fusion ticket [27] : There is a building having multiple concert rooms (venues). Tickets for venue i are sold by salesperson i who updates in the backend database, the sales for the day. The per venue ticket sale must be updated correctly in the database, so that the concert manager sees the correct total number of tickets sold. A discrepancy in this number is a violation. Each venue is represented by a process, and the communication across processes ensures the total sum is correct. We considered 4 venues and each venue had 10 tickets.

• Auction [36] and Auction-2 [36]: There are n bidders and an auctioneer participating in an auction, modeled using n+1 processes. The assertion to be checked is that the highest bidder must be declared winner. Auction is the buggy version for this application, while Auction-2 is the correct one.

• Group is a synthetic application created by us inspired from whatsapp groups. There is a group with n members, and a new person wants to be added to the group. This person must be added to the group only by one of the existing members. That is, a violation constitutes to adding a person more than once (by one or more members). We check with 6 processes(members).

## 6.4 Classical Benchmarks

Table 1 consists of classical benchmarks [13], [14], [12] and [11] which test for some assertion violations under the three models. Since the traces generated differ for each model X ∈ {CC, CCv, CM}, the violations also differ. For the ones marked SAFE under model X ∈ {CC, CCv, CM}, the assertion violation did not occur under any execution, while the unsafe ones reported the violation. We consider twenty such examples. We consider three different versions of each example, varying the number of processes and variables.

For each example, we have three versions by parameterizing the number of processes and instructions. In version 1, we have four processes per program and three to five instructions per process. Version 2 is obtained allowing each process to have seven-ten instructions. Version 3 expands version 2 by allowing each program to have up to five-six processes and up to 15-20 instructions. The number of instructions is increased by introducing fresh variables and having reads/writes on them. Versions 2,3 serve as a stress test for Conschecker as increasing the number of instructions and processes increases the number of of consistent traces. Conschecker took less than 3s to finish running all version 1 programs, about 30s to finish running all version 2 programs and about 200s to finish running all version 3 programs.

## 6.5 Parameterized Benchmarks

Table 4 reports experimental results of Conschecker on 8 parameterized benchmarks. Out of these, in redundant-co(N) (taken from [5]), N is the number of loop iterations per process in a program with 3 processes. In all others, the parameterization is on the number of processes. This set of benchmarks serves to check the scalability of Conschecker. As seen in Table 4, Conschecker scales up to 20 processes (n-writers-a-read) and 13 variables (lastzero).

## 7 Conclusion

In this paper, we have provided a DPOR algorithm using the po−rf equivalence for three prominent causal consistency models, and also implemented the same in a tool Conschecker. This is the first tool for stateless model checking of causal consistency models. We plan to extend our work by developing a DPOR algorithm for transactional programs under CC, CCv, CM [12]. For these, the extra complication is the presence of transactions which must be executed atomically without interference in each process. The final notch is to handle snapshot isolation, the strongest among transactional consistency models.

## 8 Data-Availability Statement

The tool and experimental data for the study are available at the Zenodo repository: [45].

## References


Austin, TX, USA, January 26-28, 2011. pp. 55–66. ACM (2011). https://doi.org/ 10.1145/1926385.1926394, https://doi.org/10.1145/1926385.1926394


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Symbolic Model Checking for TLA+ Made Faster

Rodrigo Otoni<sup>1</sup>(B) , Igor Konnov<sup>2</sup> , Jure Kukovec<sup>2</sup> , Patrick Eugster<sup>1</sup> , and Natasha Sharygina<sup>1</sup>

> <sup>1</sup> Università della Svizzera italiana, Lugano, Switzerland {otonir,eugstp,sharygin}@usi.ch 2 Informal Systems, Vienna, Austria {igor,jure}@informal.systems

Abstract. The need to provide formal guarantees about the behaviour of the algorithms underpinning modern distributed systems became evident in recent years. This interest made apparent the complexities involved in applying verifcation techniques in a distributed setting, with signifcant efort being made in both academia and industry to aid in this endeavour. Many formalisms have been proposed to tackle the difculties faced by practitioners, with one that has seen widespread use in industry being TLA<sup>+</sup>, adopted, for instance, by Amazon Web Services. TLA<sup>+</sup> provides engineers with a way of specifying both systems and desired properties, and is supported by a number of verifcation tools. Despite their extensive use, such tools sufer considerably from lack of scalability. To solve this, we propose a novel encoding of TLA<sup>+</sup> into SMT constraints to improve symbolic model checking efciency. Our insight is the need to provide the SMT solver with structural information about the TLA<sup>+</sup> specifcation encoded, i.e., how data structures and their component elements interact, which we do by relying on the SMT theory of arrays. We implemented our approach by modifying the SMT-based model checker Apalache and evaluated it against comparable tools. Our results show that our approach outperforms existing ones on a number of benchmarks, with an order of magnitude improvement in checking time.

Keywords: Model checking · SMT arrays · Distributed algorithms

## 1 Introduction

Distributed systems are ubiquitous in the modern world, with many companies directly relying on them to conduct business. Due to this, the ability to ensure that a distributed system is operating correctly is paramount. The search for correctness guarantees led to an infux of interested parties adopting formal verifcation methodologies in recent years. One of the most famous example of this trend is probably the adoption of TLA<sup>+</sup> [17] by Amazon Web Services [19]. TLA<sup>+</sup> is a specifcation language based on the temporal logic of actions (TLA) which allows users to describe the expected behaviour of a system, while abstracting away implementation details that do not impact high-level properties, e.g., memory management. With TLA<sup>+</sup> specifcations at hand, Amazon engineers rely on model checking for correctness guarantees of systems such as DynamoDB [23].

Despite recent interest and advances, the verifcation of distributed systems remains notoriously difcult. This is mainly due to the fact that, given their distributed nature, distributed algorithms' executions admit numerous potential interleavings of steps, with state-spaces generally growing exponentially with the number of participants. In the case of TLA<sup>+</sup>, a handful of tools are available to aid in verifcation [14]. TLC [27] is an explicit-state model checker that enumerates all reachable states of the given system. Apalache [13] is a symbolic bounded model checker that uses a satisfability modulo theories (SMT) encoding of states in order to better tackle the state-space explosion problem. TLAPS [6] is an interactive proof system that enables the proving of properties without the need of exploring the state-space itself. Despite providing the beneft of verifying specifcations with infnite state-spaces, and eforts being made towards partial automation [18], TLAPS adoption is still slow, with engineers favouring the push-button automation provided by model checkers.

In this work we focus on symbolic model checking for TLA<sup>+</sup>, as spearheaded by the SMT encoding which underpins Apalache, but provide insights into SMT-based model checking that may generalise to other contexts. The encoding of TLA<sup>+</sup> into SMT done by Apalache removes all structural information present in the encoded specifcation, with all TLA<sup>+</sup> data structures being represented via uninterpreted constants in the generated SMT formula. The information not forwarded to the SMT solver has the potential to signifcantly improve solving efciency. We propose an alternative SMT encoding that makes full use of the SMT theory of arrays [8] to encoded the main TLA<sup>+</sup> data structures, i.e., sets and functions, with the goal of improving solving performance, which is the determining factor in overall model checking performance.

Concretely, we modify Apalache's abstract reduction system (ARS) to generate constraints in the SMT theory of arrays, while relying on its preprocessing infrastructure, as shown in Figure 1. Apalache rewrites the input specifcation into the KerA<sup>+</sup> verifcation-friendly fragment of TLA<sup>+</sup> [13] and then applies ARS rules to generate the SMT formula to be solved. We implemented our encoding in Apalache and compared it with Apalache's constants encoding and TLC. Our experiments indicate that embedding structural information into the SMT formulas has a signifcant impact on performance. Our contributions are:


The paper is structured as follows: background is given in Section 2, the arrays-based encoding and its evaluation are presented in Sections 3 and 4, related work is discussed in Section 5, and our fnal remarks are made in Section 6.

Fig. 1: Overview of the symbolic model checking for TLA<sup>+</sup>. The dotted box highlights the identifcation of symbolic transitions from [16] and the rewriting into KerA<sup>+</sup>. The dashed box highlights the encoding based on uninterpreted constants from [13]. The solid box highlights the arrays-based encoding we propose.

## 2 Background

In this section we introduce the basics of TLA<sup>+</sup>, its KerA<sup>+</sup> fragment used to represent TLA<sup>+</sup>'s core, the approach to generate SMT constraints from KerA<sup>+</sup> via abstract reduction, and fnally the SMT theory of arrays.

#### 2.1 TLA+

We introduce TLA<sup>+</sup> via a specifcation of the asynchronous Byzantine agreement protocol by Bracha and Toueg [5], shown in Figure 2. Here we focus on the most relevant TLA<sup>+</sup> constructs, with further details being available in [17].

The frst notable aspect of TLA<sup>+</sup> is that specifcations may be parametrised, e.g., the number of processes and faults may not be fxed. In our example, the keyword constants, in line 3, is used to declare its parameters: N, the total number of processes, and T and F, the maximal and actual number of faulty processes. It is important to understand, however, that while a specifcation may be parametrised, model checking can only be carried out for a specifc instance of the protocol at a time, e.g., N = 4 and T = F = 1. Parameter declarations are followed by variable declarations, by the use of the variables keyword, in line 4. Variables defne the states of the state-machine that the specifcation describes, with each state being defned by the combination of the values held by each variable. In our example, each state is defned by the values of sentEcho, sentReady, rcvdEcho, rcvdReady, and pc.

The remaining TLA<sup>+</sup> operators describe state-machine transitions or properties to be checked, and are defned using <sup>∆</sup>=. Two operators are of special signifcance, one that defnes the initial-state predicate and one that plays the role of the transition operator. In our example, these operators are Init, in line 8, and Next, in line 22. Concretely, Init defnes the starting point for state-space exploration and Next defnes the exploration itself. Transitions are guided by constraints that must hold in both pre-transition states, represented by nonprimed variables, and post-transition states, represented by primed variables.

 module ABA extends Integers, FiniteSets constants N , T, F variables sentEcho, sentReady, rcvdEcho, rcvdReady, pc Corr <sup>∆</sup>= 1 . . (N − F) The set of correct processes Byz <sup>∆</sup>= (N − F + 1) . . N The set of Byzantine processes Proc <sup>∆</sup>= 1 . . N The set of all processes Init <sup>∆</sup>= ∧ pc ∈ [Corr → {"V0", "V1"}] ∧ rcvdEcho = [p ∈ Corr 7→ {}] ∧ rcvdReady = [p ∈ Corr 7→ {}] ∧ sentEcho ∈ subset Byz ∧ sentReady ∈ subset Byz Receive(p, nextEcho, nextReady) <sup>∆</sup>= . . . Omited for brevity SendEcho(p, nextEcho, nextReady) <sup>∆</sup>= . . . Omited for brevity SendReady(p, nextEcho, nextReady) ∆= ∧ pc[p] = "EC" ∧ ∨ Cardinality(nextEcho) ≥ (N + T + 2) ÷ 2 ∨ Cardinality(nextReady) ≥ T + 1 ∧ pc<sup>0</sup> = [pc except ![p] = "RD"] ∧ sentReady<sup>0</sup> = sentReady ∪ {p} ∧ unchanged sentEcho Decide(p, nextReady) ∆= ∧ pc[p] = "RD" ∧ Cardinality(nextReady) ≥ 2 ∗ T + 1 ∧ pc<sup>0</sup> = [pc except ![p] = "AC"] ∧ unchanged hsentEcho, sentReadyi Next <sup>∆</sup>= ∃ p ∈ Corr , nextEcho ∈ subset sentEcho, nextReady ∈ subset sentReady : ∧ Receive(p, nextEcho, nextReady) ∧ ∨ SendEcho(p, nextEcho, nextReady) ∨ SendReady(p, nextEcho, nextReady) ∨ Decide(p, nextReady) ∨ unchanged hpc, sentEcho, sentReadyi NoDecide <sup>∆</sup>= ∀ p ∈ Corr : pc[p] 6= "AC" Invariant stating that processes never Decide 27

Fig. 2: Example of a TLA<sup>+</sup> specifcation, based on the asynchronous Byzantine agreement protocol by Bracha and Toueg [5]; simplifcations made for brevity.

Specifcations may optionally defne invariants, i.e., properties that should hold in every reachable state. There is no special syntax for invariants, and they are provided by name to model checkers at invocation time. In our example, we have one invariant, NoDecide, in line 26. A specifcation satisfes NoDecide if no state reachable from Init via any number of Next transitions has pc[p] = "AC", for some p ∈ Corr. Abstractly, this invariant holds if Decide can never be taken.

#### 2.2 KerA+

1 TLA<sup>+</sup> provides users with a myriad of ways of specifying systems. This richness, although being one its strengths, adds signifcant difculty to the generation of SMT constraints. To overcome this challenge, TLA<sup>+</sup> specifcations are rewritten into a more compact language, KerA<sup>+</sup>, before being checked. From KerA<sup>+</sup>, the ARS can generate SMT constraints in a simpler and provably sound way.

The KerA<sup>+</sup> language consists of a small subset of TLA<sup>+</sup> conjoined with four additional constructs not originating from TLA<sup>+</sup>, and is able to express almost all TLA<sup>+</sup> expressions. It contains constructs for the manipulation of sets,

Fig. 3: Illustration of three arenas. The captions describe the modelled elements with the overapproximation c<sup>1</sup> = 5, c<sup>2</sup> = 6, c<sup>3</sup> = 7, c<sup>4</sup> = {5, 6}, c<sup>5</sup> = {6, 7}, and c<sup>6</sup> = {{5, 6}, {6, 7}}. Note that the concrete value of a cell can be given by any of the possible subtrees having said cell as a root, e.g., for c<sup>6</sup> we have that ∃ c<sup>4</sup> ∈ P({5, 6}), c<sup>5</sup> ∈ P({6, 7}) . c<sup>6</sup> ∈ P({c4, c5}); P stands for power set.

functions, records, tuples, and sequences, as well as integer arithmetic operators, Boolean and integer literals, and constants, with all data structures having a bounded size. The semantics of KerA<sup>+</sup> derive directly from the TLA<sup>+</sup> constructs it uses, with the non-TLA<sup>+</sup> based constructs, which help simplify the rewriting system, having simple control semantics. The correctness of the rewriting itself is guaranteed by construction. One example is the rewriting of S ∪ T into the set comprehension {x ∈ S : x ∈ T}. Further KerA<sup>+</sup> details are available in [13].

### 2.3 Abstract Reduction System

In order to verify a specifcation in KerA<sup>+</sup> we generate a SMT formula that is equisatisfable to it. To do so, we use an abstract reduction system (ARS) which iteratively applies reduction rules that transform KerA<sup>+</sup> expressions into SMT constraints. The core of the ARS is the arena, a graph structure that overapproxiamtes the specifcation's data structures and guides rule application. The rules collapse KerA<sup>+</sup> expressions into cells, which represent the symbolic evaluation of these expressions, with the cells then being used as vertices in the arena. The arena edges represent the data structures overapproximation, e.g., a cell representing a set will have directed edges to the cells representing all its potential elements, as illustrated in Figure 3. The reduction process terminates when the initial KerA<sup>+</sup> expression e is collapsed into a single cell c, producing a SMT formula Φ in the process, such that c∧ Φ is equisatisfable to e; equisatisfability relies on the boundedness of the data structures and is detailed in Section 3.3. The satisfability of e can then be checked by forwarding c ∧ Φ to a SMT solver.

Formally, the ARS is defned as (S, ⇝), with S being the set of ARS states and ⇝ ⊆ S × S being the transition relation. A state (e, A, ν, Φ) ∈ S is a fourtuple containing a KerA<sup>+</sup>expression e, an arena A, a binding of names to cells ν, and a frst-order formula Φ. ARS states' elements contain a number of cells, which are frst-order terms annotated with a type τ . Cells of type Bool and Int are interpreted in SMT as Booleans and integers, while cells of the remaining types are encoded as uninterpreted constants in the constants encoding; the arrays encoding approach is discussed in Section 3. Cells are referred to via the notation cname or cindex , and they can be seen as both KerA<sup>+</sup> constants and frst-order terms in SMT. An arena is a directed acyclic graph A = (V, E), with V being a fnite set of cells and E ⊆ V × (1..|V|) × V being a set of relations between the cells in V. Every relation between cells is represented by an arena edge of form (ca, i, cb), also written c<sup>a</sup> i −→cb, with no duplicates, i.e., for every pair (c<sup>a</sup><sup>1</sup> , i1, c<sup>b</sup><sup>1</sup> ),(c<sup>a</sup><sup>2</sup> , i2, c<sup>b</sup><sup>2</sup> ) ∈ E we have that c<sup>a</sup><sup>1</sup> = c<sup>a</sup><sup>2</sup> ∧ c<sup>b</sup><sup>1</sup> ̸= c<sup>b</sup><sup>2</sup> implies i<sup>1</sup> ̸= i2, and no gaps in the relation indexes, i.e., for every edge (ca, i, cb) and index j ∈ 1..(i − 1) we have that ∃ c<sup>c</sup> ∈ V . (ca, j, cc). A binding is a partial function from KerA<sup>+</sup> variables to V of A, i.e., a mapping from variables to cells. Finally, Φ is a formula in the SMT fragment supported by the ARS and the target SMT solver, e.g., the quantifer-free uninterpreted functions and non-linear arithmetics (QF\_UFNIA) fragment supported by the constants encoding.

A series of n reduction steps has the form s0⇝...⇝sn, with each step generating state si+1 for state s<sup>i</sup> , 0 ≤ i < n, by applying a reduction rule. The initial state s<sup>0</sup> = (e0, A0, ν0, Φ0) has e<sup>0</sup> as the initial KerA<sup>+</sup> specifcation, A<sup>0</sup> = (∅, ∅), ν<sup>0</sup> containing no mappings, and Φ<sup>0</sup> = true. The reduction steps end upon reaching a state s<sup>n</sup> = (en, An, νn, Φn), with e<sup>n</sup> being a single cell c ∈ V<sup>n</sup> and A<sup>n</sup> = (Vn, En). Below we give two examples of rules.

Integer literal reduction. One of the simplest rules has an integer literal num being rewritten into a cell cnum . This cell is added to the arena and a constraint equating cnum to the literal is conjoined with Φ; we use vertical lines to separate state elements and commas to indicate additions to A and conjunctions to Φ.

$$\frac{\left}{\left<\mathsf{c}\_{num}\mid\mathcal{A},\mathsf{c}\_{num}:\textsf{Int}\mid\nu\mid\Phi,\mathsf{c}\_{num}=num\right>}\text{ (Inv)}$$

The descriptions of rules can be given as inferences, with the premisses above the bar and the resulting state below it. Inferences, although reasonable to express rules such as Int, are not suitable to give the intuition about how more complex rules work. In light of this, we will use a simplifed notation moving forward. We inline inferences as ↣ and omit nonessential information, e.g., propagated values. Below we can see rule Int in this simplifed format. Note that only A and Φ updates are shown, without propagating them, and that ν is omitted.

$$\begin{array}{c} num: \textsf{lnt} \\ num \text{ is one of } 0, 1, -1, \dots \end{array} \longmapsto \mathsf{c}\_{num} \mid \mathsf{c}\_{num}: \textsf{lnt} \mid \mathsf{c}\_{num} = num \tag{\textsf{lnt}}$$

Picking. To pick a cell out of n cells we use an oracle θ, as per rule FromBasic. In addition to the FROM ... BY θ expression, this rule requires that all pickable cells are of the same basic type τ , e.g., Int. The resulting state has a new cell cpick , which is equated to one of the n cells if 1 ≤ θ ≤ n and is unconstrained otherwise. Picking among cells representing data structures, e.g., sets, can be done via a more general version of rule FromBasic, which we omit for brevity.

$$\begin{array}{c} \mathsf{FROM } \mathsf{c}\_{1}, ..., \mathsf{c}\_{n} \ \mathsf{BY} \; \theta : \tau\\ \tau \text{ is basic and } \mathsf{c}\_{1} : \tau, ..., \mathsf{c}\_{n} : \tau \end{array} \longmapsto \mathsf{c}\_{pick} \mid \mathsf{c}\_{pick} : \tau \mid \bigwedge\_{1 \le i \le n} (\theta = i \to \mathsf{c}\_{pick} = \mathsf{c}\_{i})$$

#### 2.4 SMT Theory of Arrays

The theory of arrays provides a natural way to encode data structures and is thus a prime candidate as an encoding target for TLA<sup>+</sup>constructs. Here we present the theory's operators relevant for our work, further details can be found in [8].

Given the set of sorts S, containing one sort s<sup>τ</sup> for each type τ in KerA<sup>+</sup>, an array sort s<sup>τ</sup>1,τ<sup>2</sup> has the form s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> , with s<sup>τ</sup><sup>1</sup> ∈ S being its index sort and s<sup>τ</sup><sup>2</sup> ∈ S being its value sort. Each array sort is supported by two basic operators, select : (s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> ,s<sup>τ</sup><sup>1</sup> ) → s<sup>τ</sup><sup>2</sup> , which handles array access at a given index, and store : (s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> ,s<sup>τ</sup><sup>1</sup> ,s<sup>τ</sup><sup>2</sup> ) → s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> , which updates an array for a given index and value. For brevity, we will write select(a, i) as a[i] in the remainder of the manuscript. Regarding equality between arrays, diferent interpretations are possible. We use arrays with extensionality [25], which are considered equal if they contain the same values in the same entries. Extensionality is formally defned as ∀ a, b : s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> . a = b ∨ ∃ i : s<sup>τ</sup><sup>1</sup> . a[i] ≠ b[i]. For access and update, consistency is ensured by the following property:

$$\begin{array}{c} \forall \ a: \mathbf{s}\_{\tau\_1} \Rightarrow \mathbf{s}\_{\tau\_2}, \ i: \mathbf{s}\_{\tau\_1}, \ j: \mathbf{s}\_{\tau\_1}, \ v: \mathbf{s}\_{\tau\_2} \ . \\ \underbrace{store(a, i, v)[i]}\_{\textit{access consistency}} = \iota \land \underbrace{\begin{pmatrix} i = j \lor store(a, i, v)[j] = a[j] \end{pmatrix}}\_{\textit{update consistency}} \end{array}$$

In addition to select and store, the theory of arrays can be extended with other operators, two of which are map<sup>f</sup> and K<sup>s</sup><sup>τ</sup> , whose signatures are shown below. The map<sup>f</sup> operator applies a n-ary function f : (s<sup>τ</sup><sup>1</sup> , ...,s<sup>τ</sup><sup>n</sup> ) → s<sup>τ</sup> to the values stored in each index of its array arguments, producing a new array whose values are the result of the function application, i.e., map<sup>f</sup> is the pointwise array extension of f. The K<sup>s</sup><sup>τ</sup> operator produces a constant array, with all its values being the constant provided as argument. The properties defning the behaviour of these two operators are shown after their signatures.

$$map\_f: (\mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_1}, ..., \mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_n}) \rightarrow \mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_f} \qquad \begin{array}{c} K\_{\mathbf{s}\_\tau}: \mathbf{s}\_{\tau\_{cont}} \rightarrow \mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_{cont}} \\\end{array}$$

$$\forall \; a\_1: \mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_1}, \; \dots, \; a\_n: \mathbf{s}\_\tau \Rightarrow \mathbf{s}\_{\tau\_n}, \; i: \mathbf{s}\_\tau \; . \; map\_f(a\_1, \dots, a\_n)[i] = f(a\_1[i], \dots, a\_n[i])$$

$$\forall \; i: \mathbf{s}\_{\tau\_1}, \; v: \mathbf{s}\_{\tau\_2} \; . \; K\_{\mathbf{s}\_{\tau\_1}}(v)[i] = v$$

The select and store operators are part of theory of arrays with extensionality defned in version 2.6 of the SMT-LIB standard [3]. Other operators are provided on a solver-by-solver basis, e.g., Z3 [7] supports both map<sup>f</sup> and K<sup>s</sup><sup>τ</sup> , while CVC5 [2] supports K<sup>s</sup><sup>τ</sup> ; SMT-LIB updates may add them to the standard.

## 3 Encoding TLA+ using Arrays

Our goal is to encode TLA<sup>+</sup> data structures in a structure-preserving way. To do this, we use arrays to represent the main components of TLA<sup>+</sup>, sets and functions, as SMT constraints. We follow the ARS structure described in Section 2.3, but update the reduction rules handling sets and functions. The remaining TLA<sup>+</sup> constructs, e.g., tuples, are represented as per the constants encoding.

The two efciency benefts of the arrays encoding are the ease of access of data structures and the possibility of using SMT equality. The frst beneft can be easily understood by the use of SMT select, which allows us to check a stored value by using a single constraint, in contrast to the amount of constraints used in the constants encoding, which is linear in the size of data structures' overapproximation. The second beneft afects the comparison of data structures, which can be done via a single SMT equality for sets and functions in the arrays encoding, since these structures are represented by a single SMT term, while the constants encoding requires a number of constraints that is quadratic in the size of data structures' overapproximation. A summary can be seen in Table 1. We frst describe how to encode sets and functions, and then present the correctness argument for the reduction to arrays.

#### 3.1 Encoding TLA+ Sets using Arrays

We use arrays to encode TLA<sup>+</sup> sets as characteristic functions, i.e., a set of type τ is represented by an array of sort s<sup>τ</sup> ⇒ Bool. Set membership is encoded by storing true or false on a given array index. The reduction rules used to handle the main set operators are presented below.

Set Enumeration. The simplest way to create a set is to enumerate its elements. Rule Enum reduces an explicit set of cells to a fresh cell cset, whose edges link it to its elements; cset→c1, . . . , c<sup>n</sup> is a shorthand for cset 1 −→c1, ..., cset <sup>n</sup>−→cn. There is no guarantee that the enumerated elements are unique, thus the arena may contain edges to repeated elements.

$$\{\mathsf{c}\_1, \ldots, \mathsf{c}\_n\} : \mathsf{Set}[\tau] \mapsto \mathsf{c}\_{set} \mid \mathsf{c}\_{set} : \mathsf{Set}[\tau], \mathsf{c}\_{set} \to \mathsf{c}\_1, \ldots, \mathsf{c}\_n \mid EmumCtr \quad (\mathtt{Enum})$$

The constraints EnumCtr added by the arrays encoding create an empty set, by using a constant array with the value false, ⊥, and updates the array by storing true, ⊤, on the appropriate indexes. The array resulting from the last update, a n cset , is then equated to cset. Since cells representing repeated elements lead to updates to the same index, we encode standard sets, in contrast the constants encoding, which encodes multisets due to the arena imprecision; multisets lead to multiple constraints being generated to encode membership of a single element.

$$a\_{\mathbf{c}\_{\text{set}}}^{0} = K\_{\tau}(\bot) \land \bigwedge\_{\text{empty set}} a\_{\mathbf{c}\_{\text{set}}}^{i} = store(a\_{\mathbf{c}\_{\text{set}}}^{i-1}, \mathbf{c}\_{i}, \top) \land \mathbf{c}\_{\text{set}} = a\_{\mathbf{c}\_{\text{set}}}^{n} \qquad (EnumCtr)$$

Although the amount of constraints generated by the arrays encoding to model set enumeration is equal to that of the constants encoding, it has the beneft of generating a defned interpretation for cset, the array a n cset , which is not present in the constants encoding. This has a signifcant impact on set membership and cell equality, as described below.

Set Membership. The checking of a membership relation c<sup>x</sup> ∈ cset, given the presence of the arena edges cset→c1, ..., c<sup>n</sup> and 1 ≤ x ≤ n, is straightforward. A single fresh cell of Boolean type is introduced and is equated to cset[cx].

Cell Equality. The constraints generated by encoding set membership and many other constructs assume that cells can be compared. When this is not directly the case the equalities are cached in preparation. For example, if a set of n tuples c<sup>t</sup> of size two is being equated, the constraints c<sup>t</sup><sup>i</sup> = c<sup>t</sup><sup>j</sup> ↔ c 1 <sup>t</sup><sup>i</sup> = c 1 tj ∧ c 2 <sup>t</sup><sup>i</sup> = c 2 tj , with 1 ≤ i ≤ n and 1 ≤ j ≤ n, are added to Φ; here we use c 1 <sup>t</sup> and c 2 t

Table 1: Amount of constraints generated by each SMT encoding to model the main TLA<sup>+</sup> constructs.


to represent the values of the 2-tuple. The need for this caching of equalities only arises when data structures encoded as uninterpreted constants are compared. For the remaining rules we assume that caching was done, if needed, and cells can be compared via direct equality.

Set Filter. In TLA<sup>+</sup>, the elements of a set S can be fltered by a predicate p via the expression {x ∈ S : p}. This expression will create a set F which contains only the elements of S that satisfy p, e.g., {x ∈ {−1, 0, 1} : x ≥ 0} = {0, 1}. Rule Filter reduces a flter to a new set cell, c<sup>F</sup> , whose arena overappoximation contains the elements of S, but whose constraints ensure that only fltered elements are members of F; p[y/x] means that x is replaced by y in p and parentheses indicate the application of another rule, the predicate resolution rule in this case.

$$\begin{array}{c} \{x \in \mathsf{c}\_{S} : p\} : \mathsf{Set}[\tau] \text{ and } \mathsf{c}\_{S} \to \mathsf{c}\_{1}, \dots, \mathsf{c}\_{n} \\ \qquad \longmapsto \left(p[\mathsf{c}\_{1}/x] : \mathsf{Bool}, \dots, p[\mathsf{c}\_{n}/x] : \mathsf{Bool} \mapsto \mathsf{c}\_{1}^{p}, \dots, \mathsf{c}\_{n}^{p}\right) \\ \qquad \longmapsto \mathsf{c}\_{F} \mid \mathsf{c}\_{F} : \mathsf{Set}[\tau], \mathsf{c}\_{F} \to \mathsf{c}\_{1}, \dots, \mathsf{c}\_{n} \mid FilterCr \end{array} \tag{\text{\reflectbox{ $\bf curl$ }} Ctr}$$

The constraints added use an array a 0 c<sup>F</sup> initially unconstrained, i.e., the values mapped by all the indexes of a 0 c<sup>F</sup> are unconstrained, as opposed to a 0 cset in EnumCtr . The values of a 0 <sup>c</sup><sup>F</sup> mapped by indexes c1, . . . , c<sup>n</sup> are constrained by c p 1 , . . . , c p <sup>n</sup> via array access, i.e., a 0 c<sup>F</sup> [ci ] is asserted to be true or false based on c p i , with 1 ≤ i ≤ n. We then apply pointwise conjunction to c<sup>S</sup> and a 0 c<sup>F</sup> via the map<sup>f</sup> SMT operator; we go from a 0 F to a n F to keep the array index in step with the arena overapproximation. Indexes whose values were false in S remain so in F, and indexes whose values were true in S store the flter's predicate evaluation.

^ 1≤i≤n ite c p i , a<sup>0</sup> c<sup>F</sup> [ci ], ¬a 0 c<sup>F</sup> [ci ] | {z } predicate-based constraining ∧ a n <sup>c</sup><sup>F</sup> = map∧(cS, a<sup>0</sup> c<sup>F</sup> ) | {z } pointwise conjunction ∧ c<sup>F</sup> = a n c<sup>F</sup> | {z } cell equality (FilterCtr )

Both encodings generate a linear amount of constraints, since n p[ci/x] predicates have to be considered. Unlike with EnumCtr , FilterCtr does not contain many store operations, due to the usage of map<sup>f</sup> . This avoids the need to create intermediary arrays, and is not possible in EnumCtr due to its constant array.

Set Map. The expression {e : x ∈ S} can be used to construct a set M from a set S, having all the elements of M as e[y/x], with y ∈ S. For example, the expression {x ÷ 5 : x ∈ {4, 5, 6}} yields the set {0, 1}, with ÷ denoting standard integer division. To reduce set map we use rule Map.

$$\begin{array}{c} \{e \colon x \in \mathsf{c}\_{S}\} : \mathsf{Set}[\tau] \text{ and } \mathsf{c}\_{S} \to \mathsf{c}\_{1}, \dots, \mathsf{c}\_{n} \\ \qquad \longmapsto \left(e[\mathsf{c}\_{1}/x] : \tau, \dots, e[\mathsf{c}\_{n}/x] : \tau \longmapsto \mathsf{c}\_{1}^{e} : \dots, \mathsf{c}\_{n}^{e}\right) \\ \qquad \longmapsto \mathsf{c}\_{M} \mid \mathsf{c}\_{M} : \mathsf{Set}[\tau], \mathsf{c}\_{M} \to \mathsf{c}\_{1}^{e}, \dots, \mathsf{c}\_{n}^{e} \mid MapCtr \end{array} \tag{\mathsf{Map}}$$

The constraints added in rule Map are similar to those added in rule Enum. The diference between them is that set enumeration precisely defnes the elements to be added to the new set cell, while set map is based on an existing set cell, which is a set overapproximation. Due to this, membership in M has to be guarded by membership in S, leading to a linear amount of constraints being generated.

$$\underbrace{a^{0}\_{\mathfrak{c}\_{M}} = K\_{\tau}(\bot) \land \bigwedge\_{1 \le i \le n} \text{ite}\left(a^{i}\_{\mathfrak{c}\_{M}} = \text{store}(a^{i-1}\_{\mathfrak{c}\_{M}}, \mathfrak{c}^{e}\_{i}, \top), \right)}\_{\text{empty set}} \land \underbrace{\mathsf{c}\_{\mathfrak{c}\_{M}} = a^{n}\_{\mathfrak{c}\_{M}}}\_{\text{set updates}} \land \underbrace{\mathsf{c}\_{M} = a^{n}\_{\mathfrak{c}\_{M}} \quad (MapCr)}\_{\text{cell equality}}$$

#### 3.2 Encoding TLA+ Functions using Arrays

We use arrays to encode TLA<sup>+</sup> functions directly as functions themselves. To do this, arrays are used in their general format, with a function f : s<sup>τ</sup><sup>1</sup> → s<sup>τ</sup><sup>2</sup> being encoded as an array of sort s<sup>τ</sup><sup>1</sup> ⇒ s<sup>τ</sup><sup>2</sup> . Since functions with a fnite domain can rely on infnite sorts, e.g., the integer numbers, the encoding of each function also includes constraints defning its domain set, by means of the rules described in the previous section; the result of a function application to a value outside its domain is undefned in TLA<sup>+</sup>. This approach allows us to generate SMT constraints that follow directly from TLA<sup>+</sup>, making the encoding not only more efcient, but also more natural to describe. In contrast, the constants encoding represents functions explicitly as sets of pairs of form {⟨x, f(x)⟩ : x ∈ domainf}. Due to this, its function manipulation relies on set manipulation, e.g., function comparison is encoded as set comparison, leading to a quadratic amount of constraints. The reduction rules used to handle functions are presented below.

Function Defnition. The defnition of a function in TLA<sup>+</sup> is an expression of the form [x ∈ S 7→ e], which maps every domain value v to the expression e[v/x]. This defnition is similar to that of set map {e : x ∈ S}, and thus generates constraints in a similar fashion to rule Map. The main diference is that the evaluations of the expression e[v/x] are stored as array values, rather than array indexes, i.e., function defnition uses store(a, v, e[v/x]) and set map uses store(a, e[v/x], ⊤), with v being a value in the function's domain or the set being mapped. Every encoded function has a single argument, with multiple arguments being rewritten as tuples in preprocessing.

Unlike with set cells, a function cell c<sup>F</sup> in the arena does not directly point to its values, with the arrays encoding adding two edges to c<sup>F</sup> , c<sup>F</sup> 1 −→c<sup>F</sup>dom and cF 2 −→c<sup>F</sup>pairs . Cell c<sup>F</sup>dom represents the function's domain and cell c<sup>F</sup>pairs represents the set of pairs {⟨x, f(x)⟩ : x ∈ domainf}. Cell c<sup>F</sup>pairs , despite being in the arena, has no SMT constraints modelling it in the arrays encoding, with its sole purpose being to help propagate the arena edges of the function's codomain elements.

Function Domain. Accessing a function's domain is trivial in the arrays encoding, since the domain set is generated during function defnition. This results in a simple access to the array representing the domain.

Function Update. The update of a TLA<sup>+</sup> function f is done by changing the result of applying f to an argument arg, f[arg], to be a given value v, via the expression [f EXCEPT! [arg] = v]. The update will produce a new function g which is identical f, except that g[arg] = v if arg ∈ domainf. The arrays encoding generates a single array update constraint in this case.

Function Application. The application of a function to an argument arg is conceptually simple, but is quite intricate to realize, as can be seen in rule FunApp. The arrays encoding uses an oracle to check that carg is in the domain and to gather the arena edges of cres . The FunAppCtr constraints ensure that the oracle chooses the correct index and equates the result cell to an array access on c<sup>F</sup> . Note that the value of cres comes directly from the function application expression itself, with the oracle only been needed to gather the arena edges of cres , if m > 0, via c p . The need for an oracle is restricted to functions whose codomain contain structured data, e.g., f : Int → Set[Int]. If this is not the case, e.g., g : Int → Int, rule FunApp is simplifed and FunAppCtr becomes cres = c<sup>F</sup> [carg ].

$$\begin{array}{c} \mathsf{c}\_{F}[\mathsf{c}\_{\operatorname{arg}}]: \tau \text{ and } \mathsf{c}\_{F} \xrightarrow{\mathsf{f}} \mathsf{c}\_{F\_{\operatorname{dom}}} \mathsf{c}\_{1}^{d}, \dots, \mathsf{c}\_{n}^{d} \text{ and } \mathsf{c}\_{F} \xrightarrow{\mathsf{f}} \mathsf{c}\_{F\_{\operatorname{pair}}} \mathsf{c} \mathsf{c}\_{1}^{p}, \dots, \mathsf{c}\_{n}^{p} \\ \stackrel{\scriptstyle \mathsf{f}}{\longmapsto} \left( \mathsf{FROM} \, \mathsf{c}\_{1}^{p}, \dots, \mathsf{c}\_{n}^{p} \, \mathsf{BY} \, \theta: \langle \tau\_{\operatorname{arg}}, \tau \rangle \mid \theta: \mathsf{Int} \mid 0 \le \theta \le n \longmapsto \mathsf{c}^{p} \right) \\ \stackrel{\scriptstyle \mathsf{f}}{\longmapsto} \mathsf{and} \, \mathsf{c}^{p}[2] \xrightarrow{\mathsf{c}} \mathsf{c}\_{1}, \dots, \mathsf{c}\_{m} \\ \stackrel{\scriptstyle \mathsf{f}}{\longmapsto} \mathsf{c}\_{res} \mid \mathsf{c}\_{res}: \tau, \mathsf{c}\_{res} \to \mathsf{c}\_{1}, \dots, \mathsf{c}\_{m} \mid \, FunAppCtr \\ \end{array} \tag{FunApp}$$
 
$$(\theta = i \to \mathsf{c}\_{\text{ess}} = \mathsf{c}\_{\text{c}}^{d} \wedge \mathsf{c}\_{E}, \quad [\mathsf{c}\_{\text{c}}^{d}])$$

$$\underbrace{\bigwedge\_{1\le i\le n} \wedge \underbrace{\begin{pmatrix} \theta = i \to \mathsf{c}\_{arg} = \mathsf{c}\_{i}^{d} \wedge \mathsf{c}\_{F\_{dom}}[\mathsf{c}\_{i}^{d}] \end{pmatrix}}\_{\text{oracle constraints}} \wedge \underbrace{\mathsf{c}\_{res} = \mathsf{c}\_{F}[\mathsf{c}\_{arg}]}\_{\text{oracle constraints}} \wedge \underbrace{\mathsf{c}\_{res} = \mathsf{c}\_{F}[\mathsf{c}\_{arg}]}\_{\text{cell equality}}$$

cell equality

oracle constraining

#### 3.3 Correctness of the Reduction to Arrays

Correctness of the ARS is given by four properties: fniteness of the models, compliance to the target SMT theories, termination of any reduction sequence, and soundness of the reductions. These properties have their correctness sketched for the constants encoding in [13], with detailed proofs present in [26]. Since we rely on the existing ARS and restrict our changes to mainly afect constraint generation, we have the same degree of overapproximation and the correctness arguments made for the constants encoding are in large part valid for the arrays encoding. We present below the defnition of a KerA<sup>+</sup> model and detail, for each property, how the use of arrays afects the correctness arguments and how they can be adjusted to remain valid.

Models. Every satisfable KerA<sup>+</sup> formula has a model M = ⟨D, I⟩, where D is the model domain, consisting of a disjoint union of sets D1, ..., Dn, with D<sup>i</sup> , 1 ≤ i ≤ n, containing the values for type τ<sup>i</sup> , and I is the model interpretation, consisting of assignments of domain values to KerA<sup>+</sup> constants. Models are used to access cell values, with the value of a KerA<sup>+</sup> expression e in model M being <sup>J</sup>eK<sup>M</sup>. In <sup>s</sup>before⇝safter , we go from <sup>M</sup>before to <sup>M</sup>after , with <sup>M</sup>after containing the interpretation of additional constants and being thus an extension of Mbefore .

Finiteness. This property states that every interpretation of a KerA<sup>+</sup> expression is defned only over fnite values. Its proof is derived from the fniteness of the elements being modelled. In the arrays encoding, we potentially use arrays with infnite sorts, e.g., the integers, but all SMT interpretations that can be derived from such arrays are fnite, since we encode only fnite TLA<sup>+</sup> data structures. This guarantees fniteness of all KerA<sup>+</sup> models in the arrays encoding.

Theory Compliance. This property states that any sequence of states s0⇝...⇝s<sup>n</sup> has the formulas Φ<sup>i</sup> , 1 ≤ i ≤ n, in the frst-order logic fragment containing only quantifer-free expressions over uninterpreted functions and integer arithmetic. Its proof is done by induction on the constraints generated. The constraint Φ<sup>0</sup> is always true and is thus trivially compliant. The inductive case is proved by showing that the constraint added by each rule are compliant. The rules in the arrays encoding only add array constraints, in addition to constraints supported by the constants encoding, so theory compliance is straightforward to guarantee.

Termination. This property states that every sequence of ARS reductions is fnite, i.e., the reduction process always terminates. Its proof is based on ensuring that every rule r applied to a given state sbefore yields a state safter with eafter being smaller than ebefore . An expression's length is given based on the length of its sub-expressions. The arrays encoding mainly changes constraint generation, and in the cases where rules are slightly modifed they generate resulting expressions of the same size, thus guaranteeing termination.

Soundness. This property is described in Theorem 1. Both e and Φ are KerA<sup>+</sup> expressions, but Φ is in the frst-order logic fragment supported by SMT solvers. Fundamentally, the ARS is rewriting a formula to forward it to the solver. The soundness proof consists of case analysis of each reduction rule to establish that ebefore ∧ Φbefore is equisatisfable to eafter ∧ Φafter , no matter the rule applied in sbefore⇝safter . The case analysis, which describes how eafter and Φafter can be derived from ebefore and Φbefore for each rule, relies on six invariants of the reduction system. Three invariants, 1, 3, and 4, are encoding independent, and thus are the same as in [13], the remaining three, 2, 5, and 6, are changed due to the new representation of sets and functions. Below we show all six invariants, with the modifcations needed to guarantee soundness for the arrays encoding.

Theorem 1. Let s0⇝...⇝s<sup>n</sup> be a sequence of states produced by the ARS, with s<sup>i</sup> = ei | A<sup>i</sup> | ν<sup>i</sup> | Φ<sup>i</sup> and 1 ≤ i ≤ n. Assume that e<sup>0</sup> is a formula, i.e., it has type Bool. Then e<sup>0</sup> is satisfable if the conjunction e<sup>n</sup> ∧ Φ<sup>n</sup> is satisfable.

Invariant 1 (type correctness) In every reachable state e | A | ν | Φ of the ARS, the expression e is well typed.

Invariant 2 (arena membership) In every reachable state e | A | ν | Φ of the ARS, every cell c in either the expression e or the formula Φ is also in A.

Invariant 3 (model suitability) Let sbefore⇝safter be a reachable transition in the ARS, and Mbefore be a suitable model for sbefore . An extended model Mafter from Mbefore is suitable for safter .

Invariant 4 (overapproximation) Let e | A | ν | Φ be a reachable state of the ARS, and M be its model. Assume that cset is a set cell in the arena A and that cset→c1, . . . , c<sup>n</sup> are edges in A, for some n ≥ 0. Then, it holds that <sup>J</sup>cset <sup>K</sup><sup>M</sup> ⊆ {Jc<sup>1</sup>K<sup>M</sup>, ..., <sup>J</sup>c<sup>n</sup>K<sup>M</sup>}.

Invariant 5 (function domain) Let e | A | ν | Φ be a reachable state of the ARS. Assume that c<sup>f</sup> is a function cell of type s<sup>τ</sup><sup>1</sup> → s<sup>τ</sup><sup>2</sup> in the arena A. Then, there is a cell cdom of type sSet[τ1] such that c<sup>f</sup> 1 −→Acdom.

Invariant 6 (domain reduction) Let e | A | ν | Φ be a reachable state of the ARS, and M be its model. Assume that c<sup>f</sup> is a function cell and that cf 1 −→c<sup>F</sup>dom is in the arena <sup>A</sup>. Then, it follows that <sup>J</sup>c<sup>F</sup>dom <sup>K</sup><sup>M</sup> <sup>=</sup> <sup>J</sup>domainfK<sup>M</sup>.

As described in sections 3.1 and 3.2, arrays precisely model TLA<sup>+</sup> sets and functions. The handling of sets revolves around membership constraints of form cset[c<sup>i</sup> ], which and can be set to true or false via store. Regarding functions, function application and update are trivially equivalent to array access and update. The more elaborate array operators also have a counterpart in TLA<sup>+</sup>. Constant arrays are equivalent to a function defnition for which all range values are the same constant, and array map is equivalent to set map. These equivalences explain how the changes in the arrays encoding do not invalidate the case analysis of the reduction rules used to prove Theorem 1, thus guaranteeing soundness.

## 4 Evaluation

In order to evaluate the performance impact of the arrays-based encoding, we implemented it in the Apalache model checker, which currently supports the constants encoding. Given a TLA<sup>+</sup> specifcation containing a property P, Apalache is capable of performing bounded model checking up to a length k and, if P is an inductive invariant, it can check if the property holds with an unbounded length. In both modes, Apalache checks if the SMT formula encoding the specifcation is satisfable when conjoined with ¬P, and if that is the case a counterexample (CEX) in the form of a trace is produced using the arena information and the satisfable assignment provided by the SMT solver. Our implementation adds new reduction rules to Apalache, which can be enabled via a CLI fag. When enabled, these rules replace the existing ones encoding sets and functions, as described in Section 3. In addition, we also extended Apalache's CEX generation to handle assignments to SMT formulas containing arrays. We use Z3 [7] as our back-end solver. Apalache is open-source and freely available<sup>3</sup> .

We performed a number of experiments using Apalache and the explicitstate model checker TLC. For Apalache, we evaluated both its existing constants encoding and two versions of the arrays encoding we propose, called arrays and funArrays. The arrays version encodes both TLA<sup>+</sup> sets and functions as arrays, while the funArrays version encodes only TLA<sup>+</sup> functions as arrays. The purpose of having two versions of our encoding is to evaluate the impact of encoding sets and functions as arrays separately. Our evaluation setup consisted of a machine with 64 AMD EPYC 7452 processors and 256 GB of memory. We frst present the benchmarks used and then discuss the results obtained.

#### 4.1 Benchmarks

We consider the TLA<sup>+</sup> specifcations of three asynchronous protocols as benchmarks. The frst benchmark is a specifcation of the asynchronous Byzantine agreement protocol by Bracha and Toueg [5], showed in a simplifed version in Figure 2, to which we refer as aba. The second benchmark is a specifcation of the consensus algorithm with Byzantine faults in one communication step by Dobre and Suri [9], to which we refer as cab. The third benchmark is a specifcation of the asynchronous non-blocking atomic commitment protocol by Guerraoui [12], to which we refer as nac. The common use of aba and cba is in replication scenarios with N = 3F +1 replica nodes to tolerate F failures, while the nac protocol is typically used for partitioned databases. The specifcations are available online<sup>4</sup> .

#### 4.2 Results

For each specifcation we check a variation of the agreement property. The results are shown in Figure 4. We can see that both arrays and funArrays scale in

<sup>3</sup> Available at https://github.com/informalsystems/apalache

<sup>4</sup> Available at https://github.com/informalsystems/apalache-bench

performance better that the constants encoding, with an order of magnitude improvement for some instances. It is also worth pointing out that arrays and funArrays were able to reach a result before the time limit in 29 and 28 instances, respectively, while the constants encoding was able to do so in only 20 instances. In regards to TLC, it performed worse than the three Apalache encodings in the nontrivial cases, only reaching a result before the time limit in 8 instances.

## 5 Related Work

An extensive discussion of works related to symbolic model checking for TLA<sup>+</sup> can be found in [13]. Here we focus exclusively on closely related publications. The IVy Prover [20] was designed to tackle verifcation of distributed algorithms with a decidable fragment of relational frst-order logic. Some distributed algorithms, such as the one in Figure 2, cannot be directly expressed in this fragment however, due to the use of power sets and set cardinalities. Recent eforts have focused on ofering support to reason about set cardinalities [4], but limitations remain. Cut-of based techniques to automatically infer invariants of distributed algorithms in the IVy language, such as relational abstractions of Paxos and two-phase commit, have been recently proposed [10,11]. Similar benchmarks are used in [22] to infer generalized invariants from fnite instances of TLA<sup>+</sup> and semi-automatically prove invariants with TLAPS. Specifcations of faulttolerant distributed algorithms encoded as threshold automata can be efciently verifed with ByMC [15,24]. The manual rewriting of an algorithm into threshold automata is, however, usually beyond the skills of a typical TLA<sup>+</sup> user. The work closest to ours involves the use of SMT arrays to encode EventB and TLA<sup>+</sup> specifcations in ProB [21]. The focus on ProB aims at handling infnite data structures, in contrast to our choice to work with bounded overapproximations. Reasoning about infnite domains implies the use of quantifers, which prevents the use of efcient decision procedures available for the decidable fragment of SMT, with this approach been shown to underperform when compared against Apalache in checking the benchmarks from [13]. An important last point to mention is that CVC5 has its own non-standard SMT theory of sets [1]. This theory, however, cannot currently handle nested sets, which is a very commonly used TLA<sup>+</sup> construct. It remains as a viable alternative to the SMT theory of arrays for the encoding of fat sets, but whose use implies important restrictions to the input language and, consequentially, to practical application.

## 6 Conclusions

We propose an encoding of the main TLA<sup>+</sup> constructs into the SMT theory of arrays, with the goal of providing the SMT solver with the structural information it needs to efciently reach a solution. We implemented our encoding into the Apalache model checker and our evaluation indicates that our arrays-based encoding provides a signifcant performance improvement when compared against Apalache's existing SMT encoding and the explicit-state model checker TLC.

Fig. 4: Time in checking agreement for aba, cab, and nac. Specifcations were ran in two confgurations, one in which agreement is expected to hold (OK) and one in which it is not (NotOK). Instance size stands for the number of nodes used, and the time is given in seconds in logarithmic scale; Timeout (TO) is 1 hour.

Encoding the remaining TLA<sup>+</sup> constructs in a structure-preserving way, be it via SMT arrays or algebraic datatypes, remains an interesting research avenue.

Acknowledgements Rodrigo Otoni and Natasha Sharygina's work was supported by the Swiss National Science Foundation, via grants 200021\_197353 and 200021\_185031, respectively. Igor Konnov and Jure Kukovec's work was supported by the Interchain Foundation. The authors thank Shon Feder for his kind assistance in preparing the evaluation infrastructure.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## AutoHyper: Explicit-State Model Checking for HyperLTL

Raven Beutner(B) and Bernd Finkbeiner

CISPA Helmholtz Center for Information Security, Saarbrücken, Germany {raven.beutner,finkbeiner}@cispa.de

Abstract. HyperLTL is a temporal logic that can express hyperproperties, i.e., properties that relate multiple execution traces of a system. Such properties are becoming increasingly important and naturally occur, e.g., in information-fow control, robustness, mutation testing, path planning, and causality checking. Thus far, complete model checking tools for HyperLTL have been limited to alternation-free formulas, i.e., formulas that use only universal or only existential trace quantifcation. Properties involving quantifer alternations could only be handled in an incomplete way, i.e., the verifcation might fail even though the property holds. In this paper, we present AutoHyper, an explicit-state automatabased model checker that supports full HyperLTL and is complete for properties with arbitrary quantifer alternations. We show that language inclusion checks can be integrated into HyperLTL verifcation, which allows AutoHyper to beneft from a range of existing inclusion-checking tools. We evaluate AutoHyper on a broad set of benchmarks drawn from diferent areas in the literature and compare it with existing (incomplete) methods for HyperLTL verifcation.

## 1 Introduction

Hyperproperties [16] are system properties that relate multiple executions of a system. Such properties are of increasing importance as they naturally occur, e.g., in information-fow control [36], robustness [22], linearizability [30,31], path planning [39], mutation testing [27], and causality checking [18]. A prominent logic to express hyperproperties is HyperLTL, which extends linear-time temporal logic (LTL) with explicit trace quantifcation [15]. HyperLTL can, for instance, express generalized non-interference (GNI) [34], stating that the highsecurity input of a system does not infuence the observable output.

$$\forall \pi. \forall \pi'. \exists \pi''. \Box \left(\bigwedge\_{a \in H} a\_{\pi} \leftrightarrow a\_{\pi''}\right) \land \Box \left(\bigwedge\_{a \in L \cup O} a\_{\pi'} \leftrightarrow a\_{\pi''}\right) \tag{GNI}$$

Here, H is a set of high-security input, L is a set of low-security inputs, and O is a set of low-security outputs. The formula states that for any traces π, π′ there exists a third trace π ′′ that agrees with the high-security inputs of π and with the low-security inputs and outputs of π ′ . Any observation made by a low-security attacker is thus compatible with every possible high-security input.

We are interested in the model checking (MC) problem of HyperLTL, i.e., whether a given (fnite-state) system satisfes a given property. For HyperLTL, the structure of the quantifer prefx directly impacts the complexity of this problem. For alternation-free formulas (i.e., formulas that only use quantifers of a single type), verifcation is well understood and is reducible to the verifcation of a trace property on a self-composition of the system [3]. This reduction has, for example, been implemented in MCHyper [29], a tool that can model check (alternation-free) HyperLTL formulas in systems of considerable size (circuits with thousands of latches).

Verifcation is much more challenging for properties involving quantifer alternations (such as GNI from above). While MC algorithms supporting full HyperLTL exist (see [15,29]), they have not been implemented yet. Instead, over the years, a number of approaches to the verifcation of such properties in practice have been made: Finkbeiner et al. [29] and D'Argenio et al. [22] manually strengthen properties with quantifer alternation into properties that are alternation-free and can be checked by MCHyper. Coenen et al. [19] instantiate existential quantifcation in a ∀ ∗∃ <sup>∗</sup> property (i.e., a property involving an arbitrary number of universal quantifers followed by an arbitrary number of existential quantifers, such as GNI) with an explicit (user-provided) strategy, thus reducing to the verifcation of an alternation-free formula. Alternatively, the strategy that resolves existential quantifcation can be automatically synthesized [7]. Hsu et al. [31] present a bounded model checking (BMC) approach for HyperLTL that is implemented in HyperQube. See Section 4 for more details.

While all these verifcation tools can verify (or refute) interesting properties, they all sufer from the same fundamental limitation: they are incomplete. That is, for all the tools above, we can come up with verifcation instances where they fail, not because of resource constraints but because of inherent limitations in the underlying verifcation algorithm. Moreover, such instances are not rare events but are encountered regularly in practice. For example, many of the benchmarks used to evaluate HyperQube (by Hsu et al. [31]) do not admit a strategy to resolve existential quantifcation. Conversely, many of the properties verifed by Coenen et al. [19] (such as GNI) cannot be verifed using BMC [31].

AutoHyper. In this paper, we present AutoHyper, a model checker for Hyper-LTL. Our tool checks a hyperproperty by iteratively eliminating trace quantifcation using automata-complementations, thereby reducing verifcation to the emptiness check of an automaton [29]. Importantly – and diferent from previous tools for HyperLTL verifcation such as MCHyper [29,19] and HyperQube [31] – AutoHyper can cope with (and is complete for) arbitrary HyperLTL formulas. Model checking using AutoHyper does not require manual efort (such as writing an explicit strategy in MCHyper [19]), nor does a user need to worry if the given property can even be verifed with a given method. AutoHyper thus provides a "push-button" model checking experience for HyperLTL.<sup>1</sup>

<sup>1</sup> The name of AutoHyper is derived from the fact that it is both Automata-based and Automatic (i.e., it is complete and does not require any user intervention).

To improve AutoHyper's efciency, we make the (theoretical) observation that we can often avoid explicit automaton complementation and instead reduce to a language inclusion check on Büchi automata (cf. Proposition 1). On the practical side, this enables AutoHyper to resort to a range of mature language inclusion checkers, including spot [26], RABIT [17], BAIT [25], and FORKLIFT [24].

Evaluation. Using AutoHyper, we extensively study the practical aspects of model checking HyperLTL properties with quantifer alternations. To evaluate the performance of explicit-state model checking, we apply AutoHyper to a broad range of benchmarks taken from the literature and compare it with existing (incomplete) tools. We make the surprising observation that – at least on the currently available benchmarks – explicit-state MC as implemented in AutoHyper performs on-par (and frequently outperforms) symbolic methods such as BMC [31]. Our benchmarks stem from various areas within computer science, so AutoHyper should – thanks to its "push-button" functionality, completeness, and ease of use – be a valuable addition to many areas.

Apart from using AutoHyper as a practical MC tool, we can also use it as a complete baseline to systematically evaluate existing (incomplete) methods. For example, while it is known that replacing existential quantifcation with a strategy (as done by Coenen et al. [19]) is incomplete, it was, thus far, unknown if this incompleteness occurs frequently or is merely a rare phenomenon. We use AutoHyper to obtain a ground truth and evaluate the strategy-based verifcation approach in terms of its efectiveness (i.e., how many instances it can verify despite being incomplete) and efciency.

Structure. The remainder of this paper is structured as follows. In Section 2, we introduce HyperLTL. We recap automata-based verifcation (which we abbreviate ABV) and our new approach utilizing language inclusion checks in Section 3. We discuss alternative verifcation approaches for HyperLTL in Section 4. In Section 6, we compare diferent backend solving techniques and study the complexity of HyperLTL MC with multiple quantifer alternations in practice; In Section 7, we evaluate ABV on a set of benchmarks from the literature and compare with the bounded model checker HyperQube [31]; In Section 8 we use AutoHyper for a detailed analysis of (and comparison with) strategy-based verifcation [19,7].

## 2 Preliminaries

We fx a set of atomic propositions AP and defne Σ := 2AP . HyperLTL [15] extends LTL with explicit quantifcation over traces, thereby lifting it from a logic expressing trace properties to one expressing hyperproperties [16]. Let V be a set of trace variables. We defne HyperLTL formulas by the following grammar:

$$\begin{aligned} \psi &:= a\_{\pi} \mid \neg \psi \mid \psi \land \psi \mid \mathsf{O}\,\psi \mid \psi \mathcal{U}\,\psi \\ \varphi &:= \exists \pi.\varphi \mid \forall \pi.\varphi \mid \psi \end{aligned}$$

where π ∈ V and a ∈ AP.

We assume that the formula is closed, i.e., all trace variables that are used in the body are bound by some quantifer. The semantics of HyperLTL is given with respect to a trace assignment Π : V ⇀ Σ<sup>ω</sup> mapping trace variables to traces. For π ∈ V and t ∈ Σ<sup>ω</sup>, we write Π[π 7→ t] for the trace assignment obtained by updating the value of π to t. Given a set of traces T ⊆ Σ<sup>ω</sup>, a trace assignment Π, and i ∈ N, we defne:


A transition system is a tuple T = (S, S0, κ, L) where S is a set of states, S<sup>0</sup> ⊆ S is a set of initial states, κ ⊆ S ×S is a transition relation, and L : S → Σ is a labeling function. We write s T −→ s ′ whenever (s, s′ ) ∈ κ. A path is an infnite sequence s0s1s<sup>2</sup> · · · ∈ S <sup>ω</sup>, s.t., s<sup>0</sup> ∈ S0, and s<sup>i</sup> T −→ si+1 for all i. The associated trace is given by L(s0)L(s1)L(s2)· · · ∈ Σ<sup>ω</sup>. We write Traces(T ) ⊆ Σ<sup>ω</sup> for the set of all traces generated by T . We say T satisfes a HyperLTL property φ, written T |= φ, if ∅ |=Traces(<sup>T</sup> ) φ, where ∅ denotes the empty trace assignment.

## 3 Automata-based HyperLTL Model Checking

Given a system T and HyperLTL property φ, we want to decide whether T |= φ. In this section, we recap the automata-based approach to the model checking of HyperLTL [29]. We further show how language inclusion checks can be incorporated into the model checking procedure to make use of a broad collection of mature language inclusion checkers.

### 3.1 Automata-based Verifcation

The idea of automata-based verifcation (ABV) [29] is to iteratively eliminate quantifers and thus reduce MC to the emptiness check on an automaton. A non-deterministic Büchi automaton (NBA) is a tuple A = (Q, Q0, δ, F) where Q is a fnite set of states, Q<sup>0</sup> ⊆ Q is a set of initial states, δ : Q × Σ → 2 <sup>Q</sup> is a transition function, and F ⊆ Q is a set of accepting states. We write L(A) ⊆ Σ<sup>ω</sup> for the language of A, i.e., all infnite words that have a run that visits states in F infnitely many times (see, e.g., [2]). For traces t1, . . . , t<sup>n</sup> ∈ Σ<sup>ω</sup>, we write zip(t1, . . . , tn) ∈ (Σ<sup>n</sup>) <sup>ω</sup> as the pointwise product, i.e., zip(t1, . . . , tn)(i) := (t1(i), . . . , tn(i)).

Let T = (S, S0, κ, L) be a fxed transition system and let φ˙ be some fxed closed HyperLTL formula (we use the dot to refer to the original formula and use φ, φ′ to refer to subformulas of φ˙). For some subformula φ that contains free trace variables π1, . . . , πn, we say an NBA A over Σ<sup>n</sup> is T -equivalent to φ, if for all traces t1, . . . , t<sup>n</sup> it holds that [π<sup>1</sup> 7→ t1, . . . , π<sup>n</sup> 7→ tn] |=Traces(<sup>T</sup> ) φ if zip(t1, . . . , tn) ∈ L(A). That is, A accepts exactly the zippings of traces that constitute a satisfying trace assignment for φ.

To check if T |= ˙φ, we inductively construct an automation A<sup>φ</sup> that is T equivalent to φ for each subformula φ of φ˙. For the (quantifer-free) LTL body of φ˙, we can construct this automaton via a standard LTL-to-NBA construction [29,2]. Now consider some subformula φ ′ = ∃π.φ where φ ′ contains free trace variables π1, . . . , π<sup>n</sup> and so φ contains free trace variables π1, . . . , πn, π. We are given an inductively constructed NBA A<sup>φ</sup> = (Q, Q0, δ, F) over Σ<sup>n</sup>+1 that is T equivalent to φ. We defne the automaton Aφ′ over Σ<sup>n</sup> as Aφ′ := (S × Q, S<sup>0</sup> × Q0, δ′ , S × F) where δ ′ is defned as

$$\delta'\left(\left(s,q\right),\left\right) := \left\{ \left(s',q'\right) \mid s \xrightarrow{\mathcal{T}} s' \land q' \in \delta\left(q,\left\right) \right\}.$$

Informally, Aφ′ reads the zippings of traces t1, . . . , t<sup>n</sup> and guesses a trace t ∈ Traces(T ) such that zip(t1, . . . , tn, t) ∈ L(Aφ). It is easy to see that Aφ′ is T -equivalent to φ ′ . To handle universal trace quantifcation, we consider a formula φ ′ = ∀π.φ as "φ ′ = ¬∃π.¬φ" and combine the construction for existential quantifcation with an automaton complementation.

Following the inductive construction, we obtain an automaton Aφ˙ over the singleton alphabet Σ<sup>0</sup> that is T -equivalent to φ˙. By defnition of T -equivalence, T |= ˙φ if ∅ |=Traces(<sup>T</sup> ) φ˙ if Aφ˙ is non-empty (which we can decide [21]).

#### 3.2 HyperLTL Model Checking by Language Inclusion

The algorithm outlined above requires one complementation for each quantifer alternation in the HyperLTL formula. While we cannot avoid the theoretical cost of this complementation (see [36,15]), we can reduce to a, in practice, more tamable problem: language inclusion.

For a system T , and a natural number n ∈ N we defne A<sup>n</sup> T as an NBA over Σ<sup>n</sup> such that for any traces t1, . . . , t<sup>n</sup> ∈ Σ<sup>ω</sup> we have zip(t1, . . . , tn) ∈ L(A<sup>n</sup> T ) if and only if t<sup>i</sup> ∈ Traces(T ) for every 1 ≤ i ≤ n. We can construct A<sup>n</sup> <sup>T</sup> by building the n-fold self-composition of T [3] and convert this to an automaton by moving the labels from states to edges and marking all states as accepting. We can now state a formal connection between language inclusion and HyperLTL MC (a proof can be found in the full version [9]):

Proposition 1. Let φ˙ = ∀π1. . . . ∀πn.φ be a HyperLTL formula (where φ may contain additional trace quantifers) and let A<sup>φ</sup> be an automaton over Σ<sup>n</sup> that is T -equivalent to φ. Then T |= ˙φ if and only if L(A<sup>n</sup> T ) ⊆ L(Aφ).

We can use Proposition 1 to avoid a complementation for the outermost quantifer alternation. For example, assume φ˙ = ∀π1.∀π2.∃π3.ψ where ψ is quantiferfree. Using the construction from Section 3.1, we obtain an automaton A<sup>∃</sup>π3.ψ

that is T -equivalent to ∃π3.ψ (we can construct A<sup>∃</sup>π3.ψ in linear time in the size of T ). By Proposition 1, we then have T |= ˙φ if L(A<sup>2</sup> T ) ⊆ L(A<sup>∃</sup>π3.ψ).

Note that complementation and subsequent emptiness check is a theoretically optimal method to solve the (PSPACE-complete) language inclusion problem. Proposition 1 thus ofers no asymptotic advantages over "standard" ABV in Section 3.1. In practice constructing an explicit complemented automaton is often unnecessary as the language inclusion or non-inclusion might be witnessed without a complete complementation [26,25,17,24]. This makes Proposition 1 relevant for the present work and the performance of AutoHyper.

## 4 Related Work and HyperLTL Verifcation Approaches

HyperLTL [15] is the most studied logic for expressing hyperproperties. A range of problems from diferent areas in computer science can be expressed as Hyper-LTL MC problems, including (optimal) path panning [39], mutation testing [27], linearizability [31], robustness [22], information-fow control [36], and causality checking [18], to name only a few. Consequently, any model checking tool for HyperLTL is applicable to many disciples within computer science and provides a unifed solution to many challenging algorithmic problems. In recent years, different (mostly incomplete) methods for the verifcation of HyperLTL have been developed. We discuss them below (see the full version [9] for details).

Automata-based Model Checking. Finkbeiner et al. [29] introduce the automatabased model checking approach as presented in Section 3.1. For alternation-free formulas, the algorithms corresponds to the construction of the self-composition of a system [3] and is implemented in the MCHyper tool [29]. MCHyper can handle systems of signifcant size (well beyond the reach of explicit-state methods) but is unable to handle any quantifer alternation (the main motivation for AutoHyper). htltl2mc [15] is a prototype model checker for HyperLTL<sup>2</sup> (a fragment of HyperLTL with at most one alternation) built on top of GOAL [38]. In contrast to htltl2mc, AutoHyper supports properties with arbitrarily many quantifer alternations and features automata with symbolic alphabets – which is important to handle large systems with many atomic propositions, cf. Footnote 7.

Strategy-based Verifcation. Coenen et al. [19] verify ∀ ∗∃ <sup>∗</sup> properties by instantiating existential quantifcation with an explicit strategy. This method – which we refer to as strategy-based verifcation (SBV) – comes in two favors: either the strategy is provided by the user or the strategy is synthesized automatically. In the former case, model checking reduces to checking an alternation-free formula and can thus handle large systems, but requires signifcant user efort (and is thus no "push-button" technique). In the latter case, the method works fully automatically [8,7] but requires an expensive strategy synthesis. SBV is incomplete as the strategy resolving existentially quantifed traces only observes fnite prefxes of the universally quantifed traces. While SBV can be made complete by adding prophecy variables [7], the automatic synthesis of such prophecies is currently limited to very small systems and properties that are temporally safe [5]. We investigate both the performance and incompleteness of SBV in Section 8.

Bounded Model Checking. Hsu et al. [31] propose a bounded model checking (BMC) procedure for HyperLTL. Similar to BMC for trace properties [11], the system is unfolded up to a fxed depth, and pending obligations beyond that depth are either treated pessimistically (to show the satisfaction of a formula) or optimistically (to show the violation of a formula). While BMC for trace properties reduces to SAT-solving, BMC for hyperproperties naturally reduces to QBF-solving. As usual for bounded methods, BMC for HyperLTL is incomplete. For example, it can never show that a system satisfes a hyperproperty where the LTL body contains an invariant (as, e.g., is the case for GNI).<sup>2</sup> We compare AutoHyper and BMC (in the form of HyperQube [31]) in Section 7.

## 5 AutoHyper: Tool Overview

AutoHyper is written in F# and implements the automata-based verifcation approach described in Section 3.1 and, if desired by the user, makes use of the language-inclusion-based reduction from Section 3.2. AutoHyper uses spot [26] for LTL-to-NBA translations and automata complementations. To check language inclusion, AutoHyper uses spot (which is based on determinization), RABIT [17] (which is based on a Ramsey-based approach with heavy use of simulations), BAIT [25], and FORKLIFT [24] (both based on well-quasiorders). AutoHyper is designed such that communication with external automata tools is done via established text-based formats (opposed to proprietary APIs), namely the HANOI [1] and BA automaton formats. New (or updated) tools that improve on fundamental automata operations, such as complementation and inclusion checks, can thus be integrated easily. Internally we represent automata using symbolic alphabets (similar to spot). We store transition formulas as DNFs as this allows for very efcient SAT checks, which are needed during the product construction.

All experiments in this paper were conducted on a Mac Mini with an Intel Core i3 (i3-8100B) and 16GB of memory. We used spot version 2.11.1; RABIT version 2.4.5; BAIT commit 369e1a4; and FORKLIFT commit 5d519e3.

Input Formats. AutoHyper supports both explicit-state systems (given in a HANOI-like [1] input format) and symbolic systems that are internally converted

<sup>2</sup> BMC for trace properties can be made complete by using bounds on the unrolling depth (also called completeness thresholds) [14] and including loop conditions in the encoding [11]. As remarked by Hsu et al. [31], the same is much more challenging for hyperproperties, and no solutions have been proposed. Instead, Hsu et al. [31] propose an alternative unrolling semantics (which they call halting semantics) that can mitigate this incompleteness issue for programs that terminate after a fxed number of steps. This is a strong (and often unrealistic) assumption for general reactive systems.

to an explicit-state representation. The support for symbolic systems includes Aiger circuits, symbolic models written in a fragment of the NuSMV input language [13], and a simple boolean programming language [6].

Random Benchmarks. For our evaluation, we use both existing instances from various sources in the literature and randomly generated problems.<sup>3</sup> We generate random transition systems based on the Erdős–Rényi–Gilbert model [28]. Given a size n and a density parameter p ∈ [0, 1], we generate a graph with n states, where for every two states s, s′ , there is a transition s → s ′ with probability p. To generate a graph with n edges and, in expectation, constant outdegree of k, we can choose p = k n . We further ensure that the system is connected and all states have at least one outgoing edge. We generate random HyperLTL formulas (with a given quantifer prefx) by sampling the LTL matrix using spot's randltl.

## 6 HyperLTL Model Checking Complexity in Practice

Before we turn our attention to benchmarks found in the literature, we compare the diferent backend inclusion checkers supported by AutoHyper by evaluating them on a large set of synthetic (random) benchmarks (in Section 6.1). Moreover, the random generation of benchmarks allows us to peek at formulas with more than one quantifer alternation. The theoretical hardness of model checking properties with multiple alternations has been studied extensively [15,36], and we analyze, for the frst time, how these results transfer to practice (in Section 6.2).

### 6.1 Performance of Inclusion Checkers

As the frst set of benchmarks, we compare the diferent backend inclusion checkers supported by AutoHyper. In Figure 1, we depict how many instances can be solved using the inclusion checks of spot, BAIT, RABIT, and FORKLIFT within a timeout of 10s and give the median running time used on the instances that could be solved within the timeout. We observe that spot clearly outperforms RABIT, BAIT, and FORKLIFT in terms of the percentage of instances that can be checked within 10s.<sup>4</sup> While, in general, spot solves the most instances, a manual inspection reveals that there are also instances that can only be solved by RABIT

<sup>3</sup> The advantage of randomly generated instances is twofold. First, it allows for the easy generation of a large set of benchmarks. Second, the random generation is parameterized by multiple parameters (such as system size, transition density, formula size, etc.), enabling a comprehensive analysis of the exact impact of diferent parameters on the model checking complexity in practice.

<sup>4</sup> We remark that spot operates on automata with a symbolic alphabet (i.e., transitions are defned as boolean formulas over AP). In contrast, RABIT, BAIT, and FORKLIFT only support explicit alphabets (i.e., automata with one symbol for each element in 2 AP ).

Fig. 1: We evaluate diferent backend solvers on instances of varying system size with an (on average) constant outdegree of 10 and a fxed property size of 20. We generate 20 samples per system size. We display both the success rate of each solver within a timeout of 10s (on the left axis) and the median running time on the solved instances (on the right axis).

or BAIT/FORKLIFT. This justifes why AutoHyper supports multiple backed inclusion checkers that implement diferent algorithms and thus excel on diferent problems (we will confrm this in Section 7). Moreover, our experiments provide evidence that HyperLTL MC is a natural source for challenging language inclusion benchmarks (see the full version [9]). Firefox

We remark that we set the timeout of 10s deliberately low to compute (and reproduce) the plots in a reasonable time (computing Figure 1 took about 3.5h). If a user wants to verify a given instance and does not require a result within a few seconds, running the solver for even longer will likely increase the success rate further (see also the evaluation in Section 7).

## 6.2 Model Checking Beyond ∀<sup>∗</sup>∃<sup>∗</sup>

Using randomly generated benchmarks, we can also peek at the practical complexity of model checking in the presence of multiple quantifer alternations. In theory, the model checking complexity of HyperLTL increases by one exponent with each quantifer alternation [15,36]. Using AutoHyper, we can, for the frst time, investigate the model checking complexity in practice.

Fig. 2: For properties with a varying number of quantifer alterations, we display the average time spent on the automata complementation during model checking.

1 of 1 26.01.23, 12:06

We model check randomly generated formulas with 1 to 4 quantifer alternations and visualize the total running time based on the cost of each complementation (using spot) in Figure 2 (recall that checking a formula with k alternations

Table 1: We depict the running time of AutoHyper when verifying GNI on the boolean programs taken from [6] and [10]. We give the program, the bitwidth (bw), the size of the intermediate explicit-state representation (Size), and the time taken by each solver. The timeout is set to 60s and indicated by a "-". The property holds in all cases. Times are given in seconds.


using ABV requires k automaton complementations). Although the number of quantifer alternations has an undeniable impact on the total running time (the cumulative height of each bar), the increase in runtime is not proportional to the (non-elementary) increase suggested by the theoretical analysis. Diferent from the theoretical analysis (where the (k + 1)th complementation is exponentially more expensive than the kth), the cost of each complementation barely increases (or even decreases). This suggests that the T -equivalent automata constructed in each iteration are, in practice, much smaller than indicated by the worst-case theoretical analysis. Verifcation of properties beyond one alternation is thus less infeasible than the theory suggests (at least on randomly generated test cases).

## 7 Evaluation on Symbolic Systems

In this section, we challenge AutoHyper with complex model checking problems found in the literature. Our benchmarks stem from a range of sources, including non-interference in boolean programs [6], symmetry in mutual exclusion algorithms [19], non-interference in multi-threaded programs [37], fairness in non-repudiation protocols [32], mutation testing [27], and path planning [39].

### 7.1 Model Checking GNI on Boolean Programs

We use AutoHyper to verify GNI on a range of boolean programs that process high-security and low-security inputs (taken from [6,10]). Table 1 depicts the runtime results using diferent backend solvers. We test each program with varying bitwidth and depict the largest bitwidth that can be solved by at least one solver (within a timeout of 60s). We, again, note that spot performs better than



other inclusion checkers and, in particular, scales better when the size of the system increases. Note that the number of atomic propositions is 3 in all instances, so spot's support for symbolic alphabets has a negligible impact on the running time. We emphasize that not all instances in Table 1 can be verifed using SBV [19,7] without a user-provided fxed lookahead. Likewise, BMC [31] can never verify GNI. This provides further evidence why complete model checking tools (of which AutoHyper is the frst) are necessary.

#### 7.2 Explicit Model Checking of Symbolic Systems

In this section, we evaluate AutoHyper on challenging symbolic models (NuSMV models [13]) that were used by Hsu et al. [31] to evaluate HyperQube.

The properties we verify cover a wide range of properties. For example, we verify that Lamport's bakery algorithm [33] does not satisfy various symmetry properties (as the algorithm prioritizes processes with a lower ticket ID); We check linearizability<sup>5</sup> [30] on the SNARK datastructure [23] and identify a previously known bug; And, we generate model-based mutation test cases using the approach proposed by Fellner et al. [27]. Further details on the benchmarks are provided in [31].

We check each instance using both HyperQube and AutoHyper and depict the results in Table 2. <sup>6</sup> When using AutoHyper we always apply spot's inclusion checker.<sup>7</sup> For HyperQube we use the unrolling semantics and unrolling depth listed in [31, Table 2]. We observe that for most instances – despite using explicit state methods and thus being complete (cf. Section 7.4) – AutoHyper performs on par with HyperQube. On instances using Lamport's bakery algorithm, BMC only needs to unroll to very shallow depths, resulting in very efcient solving, whereas AutoHyper's running time is dominated by spot's LTL-to-NBA translation (consuming up to 98% of the total time). Conversely, on the large SNARK example, AutoHyper performs signifcantly better.

#### 7.3 Hyperproperties for Path Planning

As a last set of benchmarks, we use planning problems for robots encoded into HyperLTL as proposed by Wang et al. [39]. For example, the synthesis of a shortest path can be phrased as a ∃∀ property that states that there exists a path to the goal such that all alternative paths to the goal take at least as long. Wang et al. [39] propose a solution to check the resulting HyperLTL property by encoding it in frst-order logic, which is then solved by an SMT solver. While not competitive with state-of-the-art planning tools, HyperLTL allows one to express a broad range of problems (shortest path, path robustness, etc.) in a very general way. Hsu et al. [31] observe that the QBF encoding implemented in HyperQube outperforms the SMT-based approach by Wang et al. [39]. In this section, we evaluate AutoHyper on these planning-hyperproperties and compare it with HyperQube<sup>8</sup> .

We depict the results in Table 3. It is evident that AutoHyper outperforms HyperQube, sometimes by orders of magnitude. This is surprising as planning problems (which are essentially reachability problems) on symbolic systems should be advantageous for symbolic methods such as BMC. The large size of the in-

<sup>5</sup> Linearizability asserts that any execution of a concurrent data structure corresponds to a sequential execution, which is naturally expressed as a ∀∃ hyperproperty.

<sup>6</sup> For the two verifcation instances (Bakery3,φS3) and (NRP : Tincorrect, φfair ) HyperQube provides the wrong verifcation result. We mark such instances with a "!" to avoid confusion when comparing Table 2 with [31, Table 2]. In particular, the supposedly unfair version of the NRP protocol is, in fact, fair.

<sup>7</sup> The automata use a symbolic alphabet with up to 18 letters. A conversion to an explicit alphabet – as required for RABIT, BAIT, and FORKLIFT – is thus infeasible (this would require 2 <sup>18</sup> symbols).

<sup>8</sup> AutoHyper is intended as a model checking tool, i.e., it only checks if a property holds or is violated. However, as we show in the full version [9], we could use the counterexamples returned by the inclusion checker to synthesize an actual plan.

Table 3: We evaluate HyperQube and AutoHyper on hyperproperties that encode the existence of a shortest path (φsp) and robust path (φrp). We give the specifcation (Spec), the size of the grid (Grid), and the times taken by HyperQube and AutoHyper (t). For HyperQube, we additionally give the unrolling depth used (k) and the fle size of the QBF generated (|QBF|). For AutoHyper, we additionally give the size of the generated explicit state space (Size). Times are given in seconds. The timeout is set to 20 min and indicated by a "-".


termediate QBF indicates that a more optimized encoding (perhaps specifc to path planning) could improve the performance of BMC on such examples.

#### 7.4 Bounded vs. Explicit-State Model Checking

Bounded model checking has seen remarkable success in the verifcation of trace properties and frequently scales to systems whose size is well out of scope for explicit-state methods [20]. Similarly, in the context of alternation-free hyperproperties, symbolic verifcation tools such as MCHyper [29] (which internally reduces to the verifcation of a circuit using ABC [12]) can verify systems that are well beyond the reach of explicit-state methods. In contrast, in the context of model checking for hyperproperties that involve quantifer alternations, our fndings make a strong case for the use of explicit-state methods (as implemented in AutoHyper):

First, compared to symbolic methods (such as BMC), explicit-state model checking is currently the only method that is complete. While BMC was able to verify or refute all properties in Tables 2 and 3, many instances cannot be solved with the current BMC encoding. As a concrete example, BMC can never verify formulas whose body contains simple invariants (such as GNI) and can thus not verify any of the instances in Table 1. The most signifcant advantage of explicitstate MC (as implemented in AutoHyper) is thus that it is both push-button and complete, i.e., it can – at least in theory – verify or refute all properties.

Second, the performance of AutoHyper seems to be on-par with that of BMC and frequently outperforms it (even by several orders of magnitude, cf. Table 3). We stress that this is despite the fact that for the evaluation of HyperQube we already fx an unrolling depth and unrolling semantics, thus creating favorable conditions for HyperQube. <sup>9</sup> While BMC for trace properties reduces to SAT solving, BMC of hyperproperties reduces to QBF solving; a problem that is much harder and has seen less support by industry-strength tools. It is, therefore, unclear whether the advance of modern QBF solvers can improve the performance of hyperproperty BMC, to the same degree that the advance of SAT solvers has stimulated the success of BMC for trace properties. Our fndings seem to indicate that, at the moment, QBF solving (often) seems inferior to an explicit (automata-based) solving strategy.

## 8 Evaluating Strategy-based Verifcation

So far, we have used AutoHyper to check hyperproperties on instances arising in the literature. In this last section, we demonstrate that AutoHyper also serves as a valuable baseline to evaluate diferent (possibly incomplete) verifcation methods. Here we focus on strategy-based verifcation (SBV), i.e., the idea of automatically synthesizing a strategy that resolves existential quantifcation in ∀ ∗∃ <sup>∗</sup> HyperLTL properties [19,7].

#### 8.1 Efectiveness of Strategy-based Verifcation

SBV is known to be incomplete [19,7]. However, due to the previous lack of complete tools for verifying ∀ ∗∃ <sup>∗</sup> properties, a detailed study into how efective SBV is in practice was impossible on a larger scale (i.e., beyond hand-crafted examples). With AutoHyper, we can, for the frst time, rigorously evaluate SBV. We use the SBV implementation from [7], which synthesizes a strategy for the ∃-player by translating the formula to a deterministic parity automaton (DPA) [35] and phrases the synthesizes as a parity game.

We have generated random transition systems and properties of varying sizes and computed a ground truth using AutoHyper. We then performed SBV (recall that SBV can never show that a property does not hold and might fail to establish that it does). We fnd that for our generated instances, the property holds in 61.1% of the cases, and SBV can verify the property in 60.4% of the cases. Successful verifcation with SBV is thus possible in many cases, even without the addition of expensive mechanisms such as prophecies [7]. On the other hand, our results show that random generation produces instances (albeit not many)

<sup>9</sup> In Tables 2 and 3, we perform a single query with a fxed unrolling depth k and semantics, i.e., we already know if we want to show satisfaction or violation and the depth needed to show this (as done in [31]). In a classical BMC loop, we would check for satisfaction and violation with an incrementally increasing unrolling depth and thus perform roughly 2k many QBF queries where k is the least bound for which satisfaction or violation can be established (if this bound even exists).

on which SBV fails (so far, examples where SBV fails required careful construction by hand). Reverting to SBV as the default verifcation strategy is thus not possible, further strengthening the case for complete model checking tools (of which AutoHyper is the frst). Firefox file:///Users/ravenbeutner/Documents/Cispa/Projects/AutomataBasedH...

#### 8.2 Efciency of Strategy-based Verifcation

After having analyzed the effectiveness of SBV (i.e., how many instances can be verifed), we turn our attention to the efciency of SBV. In theory, (automata-based) model checking of ∀ ∗∃ <sup>∗</sup> HyperLTL – as implemented in AutoHyper – is EXPSPACE-complete in the specifcation and PSPACEcomplete in the size of the system [15,36]. Conversely, SBV is 2-EXPTIME-complete in the size of the specifcation but only PTIME in the size of the system [19]. Consequently, one would expect that ABV fares better on larger specifcations and SBV fares better on larger systems (the more important measure in practice).

However, in this section, we show that this does not translate into practice (at least using the current implementation of SBV [7]). We compare the running time of AutoHyper (ABV) (using

Fig. 3: We compare ABV (AutoHyper) and SBV ([7]) on instances of varying system size. We fx the property size to 20. We generate 100 random instances for each size and take the average over the fastest L instances, where L is the minimum number of instances solved within a 5s timeout by both methods.

1 of 1 26.01.23, 10:52

spot's inclusion checker) and SBV. We break the running time into the three main steps for each method. For ABV, this is the LTL-to-NBA translation, the construction of the product automaton, and the inclusion check. For SBV, it is the LTL-to-DPA translation, the construction of the game, and the game-solving.

We depict the average cost for varying system sizes in Figure 3. We observe that SBV performs worse than ABV and, more importantly, scales poorly in the size of the system. This is contrary to the theoretical analysis of ABV and SBV. As the detailed breakdown of the running time suggests, the poor performance is due to the costly construction of the game and the time taken to solve the game. An almost identical picture emerges if we compare ABV in SBV relative to the property size (we give a plot in the full version [9]). While, in this case, the results match the theory (i.e., SBV scales worse in the size of the specifcation), we fnd that the bottleneck for SBV is not the LTL-to-DPA translation (which, in theory, is exponentially more expensive than the LTL-to-NBA translation used in ABV), but, again the construction and solving of the parity game.

We remark that the SBV engine we used [7] is not optimized and always constructs the full (reachable) game graph. The poor performance of SBV can be attributed to the fact that the size of the game does, in the worst case, scale quadratically in the size of the system (when considering ∀ 1∃ <sup>1</sup> properties). This is amplifed in dense systems (i.e., systems with many transitions), as, with increasing transition density, the size of the parity games approaches its worstcase size (see the full version [9]). In contrast, the heavily optimized inclusion checker (in this case spot) seems to be able to check inclusion in almost constant time (despite being exponential in theory). This efciency of mature language inclusion checkers is what enables AutoHyper to achieve remarkable performance that often exceeds that of symbolic methods such as BMC (cf. Section 7) and further strengthens the practical impact of Proposition 1.

## 9 Conclusion

In this paper, we have presented AutoHyper, the frst complete model checker for HyperLTL with an arbitrary quantifer prefx. We have demonstrated that AutoHyper can check many interesting properties involving quantifer alternations and often outperforms symbolic methods such as BMC, sometimes by orders of magnitude. We believe the biggest advantage of AutoHyper to be its push-button functionality combined with its completeness: As a user, one does not need to worry whether AutoHyper is applicable to a particular property (different from, e.g., SBV or BMC) and does not need to provide hints (e.g., in the form of explicit strategies in SBV).

Apart from evaluating AutoHyper's performance on a range of benchmarks, we have used AutoHyper to (1) compare various backend language inclusion checkers, (2) explore practical verifcation beyond one quantifer alternation (which is not as infeasible as suggested by the theory), and (3) rigorously evaluate the efectiveness and efciency of strategy-based verifcation in practice (which, diferent than suggested by a theoretical analysis, performs worse than automata-based methods).

Acknowledgments. This work was partially supported by the DFG in project 389792660 (Center for Perspicuous Systems, TRR 248), and by by the ERC Grant HYPER (No. 101055412). R. Beutner carried out this work as a member of the Saarbrücken Graduate School of Computer Science.

## Data Availability Statement

AutoHyper and all experiments are available at [4].

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Machine Learning/Neural Networks**

## Feature Necessity & Relevancy in ML Classifier Explanations

Xuanxiang Huang<sup>1</sup> , Martin C. Cooper<sup>2</sup> , Antonio Morgado<sup>3</sup> , Jordi Planes<sup>4</sup> , and Joao Marques-Silva5()

<sup>1</sup> University of Toulouse, Toulouse, France xuanxiang.huang@univ-toulouse.fr <sup>2</sup> Univ. Paul Sabatier, IRIT, Toulouse, France martin.cooper@irit.fr

<sup>3</sup> Universitat de Lleida, Lleida, Spain antonio.morgado@udl.cat

<sup>4</sup> Universitat de Lleida, Lleida, Spain jordi.planes@udl.cat

5 IRIT, CNRS, Toulouse, France joao.marques-silva@irit.fr

Abstract. Given a machine learning (ML) model and a prediction, explanations can be defined as sets of features which are sufficient for the prediction. In some applications, and besides asking for an explanation, it is also critical to understand whether sensitive features can occur in some explanation, or whether a non-interesting feature must occur in all explanations. This paper starts by relating such queries respectively with the problems of relevancy and necessity in logic-based abduction. The paper then proves membership and hardness results for several families of ML classifiers. Afterwards the paper proposes concrete algorithms for two classes of classifiers. The experimental results confirm the scalability of the proposed algorithms.

Keywords: Formal Explainability · Abduction · Abstraction Refinement.

## 1 Introduction

The remarkable achievements in machine learning (ML) in recent years [12,32,47] are not matched by a comparable degree of trust. The most promising ML models are inscrutable in their operation. As a direct consequence, the opacity of ML models raises distrust in their use and deployment. Motivated by a critical need for helping human decision makers to grasp the decisions made by ML models, there has been extensive work on explainable AI (XAI). Well-known examples include so-called model agnostic explainers or alternatives based on saliency maps for neural networks [9,50,58,59]. While most XAI approaches do not offer guarantees of rigor, and so can produce explanations that are unsound given the underlying ML model, there have been efforts on developing rigorous XAI approaches over the last few years [40, 54, 63]. Rigorous explainability involves the computation of explanations, but also the ability to answer a wide range of related queries [7, 8, 36].

By building on the relationship between explainability and logic-based abduction [25, 30, 40, 61], this paper analyzes two concrete queries, namely feature necessity and relevancy. Given an ML classifier, an instance (i.e. point in feature space and associated prediction) and a target feature, the goal of feature necessity is to decide whether the target feature occurs in all explanations of the given instance. Under the same assumptions, the goal of feature relevancy is to decide whether a feature occurs in some explanation of the given instance. This paper proves a number of complexity results regarding feature necessity and relevancy, focusing on well-known families of classifiers, some of which are widely used in ML. Moreover, the paper proposes novel algorithms for deciding relevancy for two families of classifiers. The experimental results demonstrate the scalability of the proposed algorithms.

The paper is organized as follows. The notation and definitions used throughout are presented in Section 2. The problems of feature necessity and relevancy are studied in Section 3, and example algorithms are proposed in Section 4. Section 5 presents experimental results for a sample of families of classifiers, Section 6 relates our contribution with earlier work and Section 7 concludes the paper.

## 2 Preliminaries

Complexity classes, propositional logic & quantification. The paper assumes basic knowledge of computational complexity, namely the classes of decision problems P, NP and Σ P 2 [6]. The paper also assumes basic knowledge of propositional logic, including the Boolean satisfiability (SAT) problem for propositional logic formulas in conjunctive normal form (CNF), and the use of SAT solvers as oracles for the complexity class NP. The interested reader is referred to textbooks on these topics [6, 13].

#### 2.1 Classification Problems

Throughout the paper, we will consider classifiers as the underlying ML model. Classification problems are defined on a set of features (or attributes) F = {1, . . . , m} and a set of classes K = {c1, c2, . . . , cK}. Each feature i ∈ F takes values from a domain D<sup>i</sup> . Domains are categorical or ordinal, and each domain can be defined on boolean, integer/discrete or real values. Feature space is defined as F = D<sup>1</sup> × D<sup>2</sup> × . . . × Dm. The notation x = (x1, . . . , xm) denotes an arbitrary point in feature space, where each x<sup>i</sup> is a variable taking values from D<sup>i</sup> . The set of variables associated with the features is X = {x1, . . . , xm}. Also the notation v = (v1, . . . , vm) represents a specific point in feature space, where each v<sup>i</sup> is a constant representing one concrete value from D<sup>i</sup> . A classifier C is characterized by a (non-constant) classification function κ that maps feature space F into the set of classes K, i.e. κ : F → K. An instance denotes a pair (v, c), where v ∈ F and c ∈ K, with c = κ(v).

#### 2.2 Examples of Classifiers

The results presented in the paper apply to a comprehensive range of widely used classifiers [28, 62]. These include, decision trees (DTs) [18, 42], decision graphs (DGs) [44] and diagrams (DDs) [1, 68], decision lists (DLs) [38, 60] and sets (DSs) [19,41], tree ensembles (TEs) [37], including random forests (RFs) [17,43] and boosted trees (BTs) [29], neural networks (NNs) [56], naive bayes classifiers (NBCs) [45, 52], classifiers represented with propositional languages, including deterministic decomposable negation normal form (d-DNNFs) [23, 35] and its proper subsets, e.g. sentential decision diagrams (SDDs) [22,66] and free binary decision diagrams (FBDDs) [23,31,68], and also monotonic classifiers. In the rest of the paper, we will analyze some families of classifiers in more detail.

d-DNNF classifiers. Negation normal form (NNF) is a well-known propositional language, where the negation operators are restricted to atoms, or inputs. Any propositional formula can de reduced to NNF in polynomial time. Let the support of a node be the set of atoms associated with leaves reachable from the outgoing edges of the node. Decomposable NNF (DNNF) is a restriction of NNF where the children of AND nodes do not share atoms in their support. A DNNF circuit is deterministic (referred to as d-DNNF) if any two children of OR nodes cannot both take value 1 for any assignment to the inputs. Restrictions of NNF including DNNF and d-DNNF exhibit important tractability properties [23]. Besides, we briefly introduce FBDDs which is a proper subset of d-DNNFs. An FBDD over a set X of Boolean variables is a rooted, directed acyclic graph comprising two types of nodes: nonterminal and terminal. A nonterminal node is labeled by a variable x<sup>i</sup> ∈ X, and has two outgoing edges, one labeled by 0 and the other by 1. A terminal node is labeled by a 1 or 0, and has no outgoing edges. For a subgraph rooted at a node labeled with a variable x<sup>i</sup> , it represents a boolean function f which is defined by the Shannon expansion: f = (xi∧f|<sup>x</sup>i=1)∨(¬xi∧f|<sup>x</sup>i=0), where f|<sup>x</sup>i=1 (f|<sup>x</sup>i=0) denotes the cofactor [16] of f with respect to x<sup>i</sup> = 1 (x<sup>i</sup> = 0). Moreover, any FBDD is read-once, meaning that each variable is tested at most once on any path from the root node to a terminal node.

Monotonic classifiers. Monotonic classifiers find a number of important applications, and have been studied extensively in recent years [26, 48, 65, 70]. Let 4 denote a partial order on the set of classes K. For example, we assume c<sup>1</sup> 4 c<sup>2</sup> 4 . . . cK. Furthermore, we assume that each domain D<sup>i</sup> is ordered such that the value taken by feature i is between a lower bound λ(i) and an upper bound µ(i). Given v<sup>1</sup> = (v11, . . . , v1<sup>i</sup> , . . . , v1m) and v<sup>2</sup> = (v21, . . . , v2<sup>i</sup> , . . . , v2m), we say that v<sup>1</sup> ≤ v<sup>2</sup> if ∀(i ∈ F).(v1<sup>i</sup> ≤ v2i). Finally, a classifier is monotonic if whenever v<sup>1</sup> ≤ v2, then κ(v1) 4 κ(v2).

Running examples. As hinted above, throughout the paper, we will consider two fairly different families of classifiers, namely classifiers represented with d-DNNFs and monotonic classifiers.

Example 1. The first example is the d-DNNF classifier C<sup>1</sup> shown in Fig. 1. It represents the boolean function (x<sup>1</sup> ∧ (x<sup>2</sup> ∨ x4)) ∨ (¬x<sup>1</sup> ∧ x<sup>3</sup> ∧ x4). The instance considered throughout the paper is (v1, c1) = ((0, 1, 0, 0), 0).

F<sup>1</sup> = {1, 2, 3, 4} D1<sup>i</sup> = {0, 1}, i = 1, . . . , 4 K<sup>1</sup> = {0, 1}

(b) Definition of F1, D1i, K<sup>1</sup>

IF x<sup>1</sup> = 1 ∧ x<sup>2</sup> = 1 THEN 1 ELSE IF x<sup>1</sup> = 1 ∧ x<sup>4</sup> = 1 THEN 1 ELSE IF x<sup>3</sup> = 1 ∧ x<sup>4</sup> = 1 THEN 1 ELSE 0

(c) Alternative representation of κ<sup>1</sup>

(a) Graphical representation of d-DDNF, i.e. κ<sup>1</sup>

Fig. 1: Example of d-DDNF classifier

$$\begin{aligned} \mathcal{F}\_2 &= \{1, 2, 3, 4\} \\ \mathbb{D}\_{2i} &= \{0, 1\}, i = 1, \dots, 4 \\ \mathcal{K}\_2 &= \{0, 1\} \end{aligned} \qquad \qquad \qquad \kappa\_2(\mathbf{x}) = \begin{cases} 1 & \text{if } x\_1 + x\_2 + x\_3 \ge 2 \\ 0 & \text{otherwise} \end{cases}$$
 
$$\text{(a) Definition of } \mathcal{F}\_2, \mathbb{D}\_{2i}, \mathcal{K}\_2 \tag{b) Definition of } \kappa\_2$$

Fig. 2: Example of a monotonic classifier

Example 2. The second running example is the monotonic classifier C<sup>2</sup> shown in Fig. 2. The instance that is considered throughout the paper is (v2, c2) = ((1, 1, 1, 1), 1).

#### 2.3 Formal Explainability

Prime implicant (PI) explanations [63] represent a minimal set of literals (relating a feature value x<sup>i</sup> and a constant v<sup>i</sup> ∈ Di) that are logically sufficient for the prediction. PI-explanations are related with logic-based abduction, and so are also referred to as abductive explanations (AXp's) [54]. AXp's offer guarantees of rigor that are not offered by other alternative explanation approaches. More recently, AXp's have been studied in terms of their computational complexity [7, 10]. There is a growing body of recent work on formal explanations [3–5, 14, 15, 24, 27, 33, 51, 54, 67].

Formally, given v = (v1, . . . , vm) ∈ F, with κ(v) = c, an AXp is any subsetminimal set X ⊆ F such that,

$$\mathsf{WAMq}(\mathcal{X}) \quad := \quad \forall (\mathbf{x} \in \mathbb{F}) . \left[ \bigwedge\_{i \in \mathcal{X}} (x\_i = v\_i) \right] \to (\kappa(\mathbf{x}) = c) \tag{1}$$

If a set X ⊆ F is not minimal but (1) holds, then X is referred to as a weak AXp. Clearly, the predicate WAXp maps 2 <sup>F</sup> into {⊥, >} (or {false, true}). Given v ∈ F, an AXp X represents an irreducible (or minimal) subset of the features which, if assigned the values dictated by v, are sufficient for the prediction c, i.e. value changes to the features not in X will not change the prediction. We can use the definition of the predicate WAXp to formalize the definition of the predicate AXp, also defined on subsets X of F:

$$\mathsf{AXp}(\mathcal{X}) \quad := \quad \mathsf{WAXp}(\mathcal{X}) \land \forall (\mathcal{X'} \subsetneq \mathcal{X}) \text{.} \neg \mathsf{WAXp}(\mathcal{X'}) \tag{2}$$

The definition of WAXp(X ) ensures that the predicate is monotone. Indeed, if X ⊆ X <sup>0</sup> ⊆ F, and if X is a weak AXp, then X 0 is also a weak AXp, as the fixing of more features will not change the prediction. Given the monotonicity of predicate WAXp, the definition of predicate AXp can be simplified as follows, with X ⊆ F:

$$\mathsf{A}\mathsf{X}\mathfrak{p}(\mathcal{X}) := \mathsf{W}\mathsf{A}\mathsf{X}\mathfrak{p}(\mathcal{X}) \land \forall (j \in \mathcal{X}) . \neg \mathsf{W}\mathsf{A}\mathsf{X}\mathfrak{p}(\mathcal{X} \mid \{j\}) \tag{3}$$

This simpler but equivalent definition of AXp has important practical significance, in that only a linear number of subsets needs to be checked for, as opposed to exponentially many subsets in (2). As a result, the algorithms that compute one AXp are based on (3) [54].

Example 3. From Example 1, and given the instance ((0, 1, 0, 0), 0), we can conclude that the prediction will be 0 if features 1 and 3 take value 0, or if features 1 and 4 take value 0. Hence, the AXp's are {1, 3} and {1, 4}. It is also apparent that the assignment x<sup>2</sup> = 1 bears no relevance on the fact that the prediction is 0.

Example 4. From Example 2, we can conclude that any sum of two variables assigned value 1 suffices for the prediction. Hence, given the instance ((1, 1, 1, 1), 1), the possible AXp's are {1, 2}, {1, 3}, and {2, 3}. Observe that the definition of κ<sup>2</sup> does not depend on feature 4.

Besides abductive explanations, another commonly studied type of explanations are contrastive or counterfactual explanations [8, 36, 39, 55]. As argued in related work [36], the duality between abductive and contrastive explanations implies that for the purpose of the queries studied in this paper, it suffices to study solely abductive explanations.

## 3 Feature Relevancy & Necessity: Theory

This section investigates the complexity of feature relevancy and necessity<sup>6</sup> . We are interested in membership results, which allow us to devise algorithms for the target problems. We are also interested in hardness results, which serve to confirm that the running time complexities of the proposed algorithms are within reason, given the problem's complexity.

#### 3.1 Defining Necessity, Relevancy & Irrelevancy

Throughout this section, a classifier C is assumed, with features F, domains D<sup>i</sup> , i ∈ F, classes K, a classification function κ : F → K, and a concrete instance (v, c), v ∈ F, c ∈ K.

<sup>6</sup> For the sake of brevity, we opt to only present sketches of some of the proofs.

Definition 1 (Feature Necessity, Relevancy & Irrelevancy). Let A denote the set of all AXp's for a classifier given a concrete instance, i.e. A = {X ⊆ F | AXp(X )}, and let t ∈ F be a target feature. Then, (i) t is necessary if t ∈ ∩X ∈<sup>A</sup>X ; (ii) t is relevant if t ∈ ∪X ∈<sup>A</sup>X ; and (iii) t is irrelevant if t ∈ F \ ∪X ∈<sup>A</sup>X .

Throughout the remainder of the paper, the problem of deciding feature necessity is represented by the acronym FNP, and the problem of deciding feature relevancy is represented by the acronym FRP.

Example 5. As shown earlier, for the d-DNNF classifier of Fig. 1, and given the instance (v1, c1) = ((0, 1, 0, 0), 0), there exist two AXp's, i.e. {1, 3} and {1, 4}. Clearly, feature 1 is necessary, and features 1, 3 and 4 are relevant. In contrast, feature 2 is irrelevant.

Example 6. For the monotonic classifier of Fig. 2, and given the instance (v2, c2) = ((1, 1, 1, 1), 1), we have argued earlier that there exist three AXp's, i.e. {1, 2}, {1, 3} and {2, 3}, which allows us to conclude that features 1, 2 and 3 are relevant, but that feature 4 is irrelevant. In this case, there are no necessary features.

The general complexity of necessity and (ir)relevancy has been studied in the context of logic-based abduction [25, 30, 61]. Recent uses in explainability are briefly overviewed in Section 6.

## 3.2 Feature Necessity

Proposition 2. If deciding WAXp(X ) is in complexity class C, then FNP is in the complexity class co-C.

Given the known polynomial complexity of deciding whether a set is a weak AXp for several families of classifiers [54], we then have the following result: Corollary 3. For DTs, XpG's<sup>7</sup> , NBCs, d-DNNF classifiers and monotonic classifiers, FNP is in P.

## 3.3 Feature Relevancy: Membership Results

Proposition 4 (Feature Relevancy for DTs [36]). FRP for DTs is in P. Proposition 5. If deciding WAXp(X ) is in P, then FRP is in NP.

The argument above can also be used for proving the following results.

Corollary 6. For XpG's, NBCs, d-DNNF classifiers and monotonic classifiers, FRP is in NP.

Proposition 7. If deciding WAXp(X ) is in NP, then FRP is in Σ P 2 .

Corollary 8. For DLs, DSs, RFs, BTs, and NNs, FRP is in Σ P 2 .

Additional results. The following result will prove useful in designing algorithms for FRP in practice.

Proposition 9. Let X ⊆ F, and let t ∈ X denote some target feature such that, WAXp(X ) holds and WAXp(X \ {t}) does not hold. Then, for any AXp Z ⊆ X ⊆ F, it must be the case that t ∈ Z.

<sup>7</sup> Explanation graphs (XpG's) have been proposed to enable the computation of explanations for decision graphs, and (multi-valued) decision diagrams [36].

#### 3.4 Feature Relevancy: Hardness Results

Proposition 10 (Relevancy for DNF Classifiers [36]). Feature relevancy for a DNF classifier is Σ P 2 -hard.

Proposition 11. Feature relevancy for monotonic classifiers is NP-hard.

Proof. We say that a CNF is trivially satisfiable if some literal occurs in all clauses. Clearly, SAT restricted to nontrivial CNFs is still NP-complete. Let Φ be a not trivially satisfiable CNF on variables x1, . . . , xk. Let N = 2k. Let Φ˜ be identical to Φ except that each occurrence of a negative literal x<sup>i</sup> (1 ≤ i ≤ k) is replaced by xi+k. Thus Φ˜ is a CNF on N variables each of which occur only positively. Define the boolean classifier κ (on N +1 features) by κ(x0, x1, . . . , x<sup>N</sup> ) = 1 iff x<sup>i</sup> = xi+<sup>k</sup> = 1 for some i ∈ {1, . . . , k} or x<sup>0</sup> ∧ Φ˜(x1, . . . , x<sup>N</sup> ) = 1. To show that Φ is monotonic we need to show that a ≤ b ⇒ κ(a) ≤ κ(b). This follows by examining the two cases in which κ(a) = 1: if a<sup>i</sup> = ai+<sup>k</sup> ∧ a ≤ b, then b<sup>i</sup> = bi+k, whereas, if a<sup>0</sup> ∧ Φ˜(a1, . . . , a<sup>N</sup> ) = 1 and a ≤ b, then b<sup>0</sup> ∧ Φ˜(b1, . . . , b<sup>N</sup> ) = 1 (by positivity of Φ˜), so in both cases κ(b) = 1 ≥ κ(a).

Clearly κ(1N+1) = 1. There are k obvious AXp's of this prediction, namely {i, i + k} (1 ≤ i ≤ k). These are minimal by the assumption that Φ is not trivially satisfiable. This means that no other AXp contains both i and i + k for any i ∈ {1, . . . , k}. Suppose that Φ(u) = 1. Let X<sup>u</sup> be {0} ∪ {i | 1 ≤ i ≤ k ∧ u<sup>i</sup> = 1} ∪ {i + k | 1 ≤ i ≤ k ∧ u<sup>i</sup> = 0}. Then X<sup>u</sup> is a weak AXp of the prediction κ(1) = 1. Furthermore X<sup>u</sup> does not contain any of the AXp's {i, i+k}. Therefore some subset of X is an AXp and clearly this subset must contain feature 0. Thus if Φ is satisfiable, then there is an AXp which contains 0.

We now show that the converse also holds. If X is an AXp of κ(1N+1) = 1 containing 0, then it cannot also contain any of the pairs i, i + k (1 ≤ i ≤ k), otherwise we could delete 0 and still have an AXp. We will show that this implies that we can build a satisfying assignment u for Φ. Consider first v = (v0, . . . , v<sup>N</sup> ) defined by v<sup>i</sup> = 1 if i ∈ X (0 ≤ i ≤ N) and vi+<sup>k</sup> = 1 if neither i nor i+k belongs to X (1 ≤ i ≤ k), and v<sup>i</sup> = 0 otherwise (1 ≤ i ≤ N). Then κ(v) = 1 by definition of an AXp, since v agrees with the vector 1 on all features in X . We can also note that v<sup>0</sup> = 1 since 0 ∈ X . Since X does not contain i and i + k (1 ≤ i ≤ k), it follows that v<sup>i</sup> =6 vi+k. Now let u<sup>i</sup> = 1 iff i ∈ X ∧1 ≤ i ≤ k. It is easy to verify that Φ(u) = Φ˜(v) = κ(v) = 1.

Thus, determining whether κ(1N+1) = 1 has an AXp containing the feature 0 is equivalent to testing the satisfiability of Φ. It follows that FRP is NP-hard for monotonic classifiers by this polynomial reduction from SAT.

Proposition 12. Relevancy for FBDD classifiers is NP-hard.

Proof. Let ψ be a CNF formula defined on a variable set X = {x1, . . . , xm} and with clauses {ω1, . . . , ωn}. We aim to construct an FBDD classifier G (representing a classification function κ) based on ψ and a target variable in polynomial time, such that: ψ is SAT iff for κ there is an AXp containing this target variable.

For any literal l<sup>j</sup> ∈ ω<sup>i</sup> , replace l<sup>j</sup> with l i j . Let ψ <sup>0</sup> = {ω 0 1 , . . . , ω<sup>0</sup> <sup>n</sup>} denote the resulting CNF formula defined on the new variables {x 1 1 , . . . , x<sup>1</sup> m, . . . x<sup>n</sup> 1 , . . . , x<sup>n</sup> <sup>m</sup>}. For each original variable x<sup>j</sup> , let I + j and I − j denote the indices of clauses containing literal x<sup>j</sup> and ¬x<sup>j</sup> , respectively. So if i ∈ I + j , then x i <sup>j</sup> ∈ ω 0 i , if i ∈ I − j , then ¬x i <sup>j</sup> ∈ ω 0 i . To build an FBDD D from ψ 0 : 1) build an FBDD D<sup>i</sup> for each ω 0 i ; 2) replace the terminal node 1 of D<sup>i</sup> with the root node of Di+1; D is read-once because each variable x i j occurs only once in ψ 0 . Satisfying a literal x i <sup>j</sup> ∈ ω 0 <sup>i</sup> means x<sup>j</sup> = 1, while satisfying a literal ¬x k <sup>j</sup> ∈ ω 0 <sup>k</sup> means x<sup>j</sup> = 0. If both x i j and ¬x k j are satisfied, then it means we pick inconsistent values for the variable x<sup>j</sup> , which is unacceptable. Let us define φ to capture inconsistent values for any variable x<sup>j</sup> :

$$\phi := \bigvee\_{\substack{1 \le j \le m \\ \ell\_1, \dots, \ell\_j}} \left( \left( \bigvee\_{i \in I\_j^+} x\_j^i \right) \wedge \left( \bigvee\_{k \in I\_j^-} \neg x\_j^k \right) \right) \tag{4}$$

If I + <sup>j</sup> <sup>=</sup> <sup>∅</sup>, then let W i∈I + j x i j = 0. If I − <sup>j</sup> <sup>=</sup> <sup>∅</sup>, then let W k∈I − j ¬x k j = 0. Any true point of φ means we pick inconsistent values for some variable x<sup>j</sup> , so it represents an unacceptable point of ψ. To avoid such inconsistency, one needs to at least falsify either W i∈I + j x i j or W k∈I − j ¬x k j for each variable x<sup>j</sup> . To build an FBDD G from φ: 1) build FBDDs G + j and G − j for W i∈I + j x i j and W k∈I − j ¬x k j , respectively; 2) replace the terminal node 1 of G + <sup>j</sup> with the root node of G − j , let G<sup>j</sup> denote the resulting FBDD; 3) replace the terminal 0 of G<sup>j</sup> with the root node of Gj+1; G is read-once because each variable x i j occurs only once in φ.

Create a root node labeled x 0 0 , link its 1-edge to the root of D, 0-edge to the root of G. The resulting graph G is an FBDD representing κ := (x 0 <sup>0</sup> ∧ ψ 0 ) ∨ (¬x 0 <sup>0</sup>∧φ), κ is a boolean classifier defined on {x 0 0 , x<sup>1</sup> 1 , . . . , x<sup>n</sup> <sup>m</sup>} and x 0 0 is the target variable. The number of nodes of G is O(n×m). Let I = {(0, 0),(1, 1), . . .(n, m)} denote the set of variable indices, for variable x i j , (i, j) ∈ I.

Pick an instance v = {v 0 0 , . . . , v<sup>i</sup> j , . . . } satisfying every literal of ψ 0 (i.e. v i <sup>j</sup> = 1 and v k <sup>j</sup> = 0 for x i j , ¬x k <sup>j</sup> ∈ ψ 0 ) and such that v 0 <sup>0</sup> = 1, then ψ 0 (v) = 1, and so κ(v) = 1. Suppose X ⊆ I is an AXp of v: 1) If {(i, j),(k, j)} ⊆ X for some variable x<sup>j</sup> , where i ∈ I + j and k ∈ I − j , then for any point u of κ such that u i <sup>j</sup> = v i j for any (i, j) ∈ X , we have κ(u) = 1 and φ(u) = 1. Moreover, if u sets u 0 <sup>0</sup> = 1, then κ(u) = 1 implies ψ 0 (u) = 1, else if u sets u 0 <sup>0</sup> = 0, then κ(u) = 1 because of φ(u) = 1. κ(u) = 1 regardless the value of u 0 0 , so (0, 0) 6∈ X . 2) If {(i, j),(k, j)} 6⊆ X for any variable x<sup>j</sup> , where i ∈ I + j and k ∈ I − j , then for some point u of κ such that u i <sup>j</sup> = v i j for any (i, j) ∈ X , we have φ(u) 6= 1, in this case κ(u) = 1 implies ψ 0 (u) = 1, besides, any such u must set u 0 <sup>0</sup> = 1, so (0, 0) ∈ X . If case 2) occurs, then ψ is satisfiable. (a satisfying assignment is x<sup>j</sup> = 1 iff ∃i ∈ I + j s.t. (i, j) ∈ X ). If case 2) never occurs, then ψ is unsatisfiable. It follows that FRP is NP-hard for FBDD classifiers by this polynomial reduction from SAT.

Corollary 13. Relevancy for d-DNNF classifiers is NP-hard.

## 4 Feature Relevancy: Example Algorithms

This section details two methods for FRP. One method decides feature relevancy for d-DNNF classifiers, whereas the other method decides feature relevancy for


Table 1: Encoding for deciding whether there is a weak AXp including feature t.

arbitrary monotonic classifiers. Based on Proposition 2 and Corollary 3, existing algorithm for computing one AXp [35, 36, 52, 53] can be used to decide feature necessity. Hence, there is no need for devising new algorithms. Additionally, the weak AXp returned from the proposed methods (if it exist) can be fed (as a seed) into the algorithms of computing one AXp [35, 53] to extract one AXp in polynomial time.

#### 4.1 Relevancy for d-DNNF Classifiers

This section details a propositional encoding that decides feature relevancy for d-DNNFs. The encoding follows the approach described in the proof of Proposition 9, and comprises two copies (C <sup>0</sup> and C t ) of the same d-DNNF classifier C, C 0 encodes WAXp(X ) (i.e. the prediction of κ remains unchanged), C t encodes ¬WAXp(X \ {t}) (i.e. the prediction of κ changes). The encoding is polynomial in the size of classifier's representation.

The encoding is applicable to the case κ(x) = 0. The case κ(x) = 1 can be transformed to ¬κ(x) = 0, so we assume both d-DNNF C and its negation ¬C are given. To present the constraints included in this encoding, we need to introduce some auxiliary boolean variables and predicates.


The encoding is summarized in Table 1. As literals are d-DNNF leafs, the values of the selector variables only affect the values of the indicator variables of leaf nodes. Constraint (1.1) states that for any leaf node j whose literal is consistent with the given instance, its indicator n k j is always consistent regardless of the value of s<sup>i</sup> . On the contrary, constraint (1.3) states that for any leaf node j whose literal is inconsistent with the given instance, its indicator n k j is consistent iff feature i is not picked, in other words, feature i can take any value. Because replica k (k > 0) is used to check the necessity of including feature k in X , we assume the value of the local copy of selector s<sup>k</sup> is 0 in replica k. In this case, as defined in constraint (1.2), even though leaf node j labeled feature k has a literal that is inconsistent with the given instance, its indicator n k j is consistent. Constraint (1.4) defines the indicator for an arbitrary ∨ node j. Constraint (1.5) defines the indicator for an arbitrary ∧ node j. Together, these constraints declare how the consistency is propagated through the entire d-DNNF. Constraint (1.6) states that the prediction of the d-DNNF classifier C remains 0 since the selected features form a weak AXp. Constraint (1.7) states that if feature i is selected, then removing it will change the prediction of C. Finally, constraint (1.8) indicates that feature t must be included in X .

Example 7. Given the d-DNNF classifier of Fig. 1 and the instance (v1, c1) = ((0, 1, 0, 0), 0), suppose that the target feature is 3. We have selectors s = {s1, s2, s3, s4}, and the encoding is as follows:

$$\begin{array}{c} \left(1.\ (n\_1^0 \leftrightarrow n\_2^0 \lor n\_3^0) \land (n\_2^0 \leftrightarrow n\_4^0 \land n\_5^0) \land (n\_3^0 \leftrightarrow n\_6^0 \land n\_7^0) \land (n\_5^0 \leftrightarrow n\_8^0 \lor n\_9^0)\right) \\\ (n\_7^0 \leftrightarrow n\_{10}^0 \land n\_{11}^0) \land (n\_9^0 \leftrightarrow n\_{12}^0 \land n\_{13}^0) \land (n\_4^0 \leftrightarrow \neg s\_1) \land (n\_6^0 \leftrightarrow 1) \land (n\_8^0 \leftrightarrow 1) \land \\\ (n\_{10}^0 \leftrightarrow \neg s\_3) \land (n\_{11}^0 \leftrightarrow \neg s\_4) \land (n\_{12}^0 \leftrightarrow \neg s\_2) \land (n\_{13}^0 \leftrightarrow \neg s\_4) \land (\neg n\_1^0) \land (s\_3) \\\ 2.\ (n\_1^3 \leftrightarrow n\_2^3 \lor n\_3^3) \land (n\_2^3 \leftrightarrow n\_4^3 \land n\_5^3) \land (n\_3^3 \leftrightarrow n\_6^3 \land n\_7^3) \land (n\_5^3 \leftrightarrow n\_8^3 \lor n\_9^3) \land \\\ \end{array}$$

$$\begin{array}{c} \left(n\_7^3 \leftrightarrow n\_{10}^3 \land n\_{11}^3\right) \land \left(n\_9^3 \leftrightarrow n\_{12}^3 \land n\_{13}^3\right) \land \left(n\_4^3 \leftrightarrow \neg s\_1\right) \land \left(n\_6^3 \leftrightarrow 1\right) \land \left(n\_8^3 \leftrightarrow 1\right) \land\\ \left(n\_{10}^3 \leftrightarrow 1\right) \land \left(n\_{11}^3 \leftrightarrow \neg s\_4\right) \land \left(n\_{12}^3 \leftrightarrow \neg s\_2\right) \land \left(n\_{13}^3 \leftrightarrow \neg s\_4\right) \land \left(s\_3 \leftrightarrow n\_1^3\right) \end{array}$$

Given the AXp's listed in Example 3, by solving these formulas we will either obtain {1, 3} or {1, 4} as the AXp.

#### 4.2 Relevancy for Monotonic Classifiers

This section describes an algorithm for FRP in the case of monotonic classifiers. No assumption is made regarding the actual implementation of the monotonic classifier.

Abstraction refinement for relevancy. The algorithm proposed in this section iteratively refines an over-approximation (or abstraction) of all the subsets S of F such that: i) S is a weak AXp, and ii) any AXp included in S also includes the target feature t. Formally, the set of subsets of F that we are interested in is defined as follows:

$$\mathbb{H} = \{ \mathcal{S} \subseteq \mathcal{F} \mid \mathsf{WACC}(\mathcal{S}) \land \forall (\mathcal{X} \subseteq \mathcal{S}) . \left[ \mathsf{AXp}(\mathcal{X}) \to (t \in \mathcal{X}) \right] \}\tag{5}$$

The proposed algorithm iteratively refines the over-approximation of set H until one can decide with certainty whether t is included in some AXp. The refinement step involves exploiting counterexamples as these are identified. (The approach is referred to as abstraction refinement FRP, since the use of abstraction refinement can be related with earlier work (with the same name) in model checking [20].) In practice, it will in general be impractical to manipulate such over-approximation of set H explicitly. As a result, we use a propositional formula (in fact a CNF formula) H, such that the models of H encode the subsets of features about which we have yet to decide whether each of those subsets only contains AXp's that include t. (Formula H is defined on a set of Boolean variables {s1, . . . , sm}, where each s<sup>i</sup> is associated with feature i, and assigning s<sup>i</sup> = 1 denotes that feature i is included in a given set, as described below.) The algorithm then iteratively refines the over-approximation by filtering out sets of sets that have been shown not to be included in H, i.e. the so-called counterexamples.

Algorithm 1 summarizes the proposed approach<sup>8</sup> . Also, Algorithms 2 and 3 provide supporting functions. (For simplicity, the function calls of Algorithms 2 and 3 show the arguments, but not the parameterizations.) Algorithm 1 iteratively uses an NP oracle (in fact a SAT solver) to pick (or guess) a subset P of F, such that any previously picked set is not repeated. Since we are interested in feature t, we enforce that the picked set must include t. (This step is shown in lines 4 to 7.) Now, the features not in P are deemed universal, and so we need to account for the range of possible values that these universal features can take. For that, we update lower and upper bounds on the predicted classes. For the features in P we must use the values dictated by v. (This is shown in lines 8 and 9, and it is sound to do because we have monotonicity of prediction.) If the lower and upper bounds differ, then the picked set is not even a weak AXp, and so we can safely remove it from further consideration. This is achieved by enforcing that at least one of the non-picked elements is picked in the future. (As can be observed H is updated with a positive clause that captures this constraint, as shown in line 11.) If the lower and upper bounds do not differ (i.e. we picked a weak AXp), and if by allowing t to take any value causes the bounds to differ, then we know that any AXp in P must include t, and so the algorithm reports P as a weak AXp that is guaranteed to be included in H. (This is shown in line 14.) It should be noted that P is not necessarily an AXp. However, by Proposition 9, P is guaranteed to be a weak AXp such that any of the AXp's contained in P must include feature t. From [53], we know that we can extract an AXp from a weak AXp in polynomial time, and in this case we are guaranteed to always pick one that includes t. Finally, the last case is when allowing t to take any value does not cause the lower and upper bounds to change. This means we picked a set P that is a weak AXp, but not all AXp's in P include the target feature t (again due to Proposition 9). As a result, we must prevent the same weak AXp from being re-picked. This is achieved by requiring that at least one of the picked features not be picked again in the feature set. (This is shown in line 16. As can be observed, H is updated with a negative clause that captures this constraint.)

As can be concluded from Algorithm 1 and from the discussion above, Proposition 9 is essential to enable us to use at most two classification queries per iter-

<sup>8</sup> Arguments can either represent actual arguments or some parameterization; these are separated by a semi-colon.

#### Algorithm 1 Deciding feature relevancy for a monotonic classifier



Table 2: Example algorithm execution for t = 4

ation of the algorithm. If we were to use Proposition 5 instead, then the number of classification queries would be significantly larger.

Example 8. We consider the monotonic classifier of Fig. 2, with instance (v, c) = ((1, 1, 1, 1), 1). Table 2 summarizes a possible execution of the algorithm when t = 4. Similarly, Table 3 summarizes a possible execution of the algorithm when t = 1. (As with the current implementation, and for both examples, the creation of clauses uses no optimizations.) In general, different executions will be determined by the models returned by the SAT solver.

With respect to the clauses that are added to H at each step, as shown in Algorithms 2 and 3, one can envision optimizations (shown lines 2 to 7 in both algorithms) that heuristically aim at removing features from the given sets, and so produce shorter (and so logically stronger) clauses. The insight is that any feature, which can be deemed irrelevant for the condition used for constructing


Table 3: Example algorithm execution for t = 1


the clause, can be safely removed from the set. (In practice, our experiments show that the time running the classifier is far larger than the time spent using the NP oracle to guess sets. Thus, we opted to use the simplest approach for constructing the clauses, and so reduce the number of classification queries.)

Given the above discussion, we can conclude that the proposed algorithm is sound, complete and terminating for deciding feature relevancy for monotonic classifiers. (The proof is straightforward, and it is omitted for the sake of brevity.) Proposition 14. For a monotonic classifier C, defined on set of features F, with κ mapping F to K, and an instance (v, c), v ∈ F, c ∈ K, and a target feature t ∈ F, Algorithm 1 returns a set P ⊆ F iff P is a weak AXp for (v, c), with the property that any AXp X ⊆ P is such that t ∈ X (i.e. P is a witness for the relevancy of t).

## 5 Experimental Results

This section reports the experimental results on FRP for the d-DNNF and monotonic classifiers. The goal is to show that FRP is practically feasible. We opt not to include experiments for FNP as the complexity of FNP is in P. Besides, to the best of our knowledges, there is no baseline to compare with. The experiments were performed on a MacBook Pro with a 6-Core Intel Core i7 2.6 GHz processor with 16 GByte RAM, running macOS Monterey.

d-DNNF classifiers. For d-DNNFs, we pick its subset SDDs as our target classifier. SDDs support polynomial time negation, so given a SDD C, one can obtain its negation ¬C efficiently.


Table 4: Solving FRP for SDDs. Sub-Columns Avg. #var and Avg. #cls show, respectively, the average number of variables and clauses in a CNF encoding. Column Runtime reports maximum and average time in seconds for deciding FRP.

Monotonic classifiers. For monotonic classifiers, we consider the Deep Lattice Network (DLN) [70] as our target classifier. Since our approach for monotonic classifier is model-agnostic, it could also be used with other approaches for learning monotonic classifiers [48, 69] including Min-Max Network [21, 64] and COMET [65].

Prototype implementation. Prototype implementations of the proposed approaches were implemented in Python <sup>9</sup> . The PySAT toolkit <sup>10</sup> was used for propositional encodings. Besides, PySAT invokes the Glucose 4 <sup>11</sup> SAT solver to pick a weak AXp candidate. SDDs were loaded by using the PySDD <sup>12</sup>package.

Benchmarks & training. For SDDs, we selected 11 datasets from Density Estimation Benchmark Datasets<sup>13</sup>. [34, 46, 49]. 11 datasets were used to learn SDD using LearnSDD [11] (with parameter maxEdges=20000 ). The obtained SDDs were used as binary classifiers. For DLNs, we selected 5 publicly available datasets: australian (aus), breast\_cancer (b.c.), heart\_c, nursery [57] and pima [2]. We used the three-layer DLN architecture: Calibrators → Random Ensemble of Lattices → Linear Layer. All calibrators for all models used a fixed number of 20 keypoints. And the size of all lattices was set to 3.

Results for SDDs. For each SDD, 100 test instances were randomly generated. All tested instances have prediction 0. (We didn't pick instances predicted to class 1 as this requires the compilation of a new classifier which may have dif-

<sup>9</sup> https://github.com/XuanxiangHuang/frp-experiment

<sup>10</sup> https://github.com/pysathq/pysat

<sup>11</sup> https://www.labri.fr/perso/lsimon/glucose/

<sup>12</sup> https://github.com/wannesm/PySDD

<sup>13</sup> https://github.com/UCLA-StarAI/Density-Estimation-Datasets

Table 5: Solving FRP for DLN. Column Runtime reports maximum and average time in seconds for deciding FRP. Column SAT Time (resp. κ(v) Time) reports maximum and average time in seconds for SAT solver (resp. calling DLN's predict function) to decide FRP. Column SAT Calls (resp. κ(v) Calls) reports maximum and average number of calls to the SAT solver (resp. to the DLN's predict function) to decide FRP.


ferent size). Besides, for each instance, we randomly picked a feature appearing in the model. Hence for each SDD, we solved 100 queries. Table 4 summarizes the results. It can be observed that the number of nodes of the tested SDD is in the range of 3704 and 9472, and the number of features of tested SDD is in the range of 183 and 513. Besides, the percentage of examples for which the answer is Y (i.e. target feature is in some AXp) ranges from 85% to 100%. Regarding the runtime, the largest running time for solving one query can exceed 15 minutes. But the average running time to solve a query is less than 25 seconds, this highlights the scalability of the proposed encoding.

Results for DLNs. For each DLN, we randomly picked 200 tested instances, and for each tested instance, we randomly pick a feature. Hence for each DLN, we solved 200 queries. Table 5 summarizes the results. The use of a SAT solver has a negligible contribution to the running time. Indeed, for all the examples shown, at least 97% of the running time is spent running the classifier. This should be unsurprising, since the number of the iterations of Algorithm 1 never exceeds a few hundred. (The fraction of a second reported in some cases should be divided by the number of calls to the SAT solver; hence the time spent in each call to the SAT solver is indeed negligible.) As can be observed, the percentage of examples for which the answer is Y (i.e. target feature is in some AXp and the algorithm returns true) ranges from 35% to 74%. There is no apparent correlation between the percentage of Y answers and the number of iterations. The large number of queries accounts for the number of times the DLN is queried by Algorithm 1, but it also accounts for the number of times the DLN is queried for extracting an AXp from set P (i.e. the witness) when the algorithm's answer is true. A loose upper bound on the number of queries to the classifier is 4 × NS + 2 × |F|, where NS is the number of SAT calls, and |F| is the number of features. Each iteration of Algorithm 1 can require at most 4 queries to the classifier. After reporting P, at most 2 queries per feature will be required to extract the AXp (see Section 2.3). As can be observed this loose upper bound is respected by the reported results.

## 6 Related Work

The problems of necessity and relevancy have been studied in logic-based abduction since the early 90s [25, 30, 61]. However, this earlier work did not consider the classes of (classifier) functions that are considered in this paper.

There has been recent work on explainability queries [7, 8,36]. Some of these queries can be related with feature relevancy and necessity. For example, relevancy and necessity have been studied with respect to a target class [7, 8], in contrast with our approach that studies a concrete instance, and so can be naturally related with earlier work on abduction. Recent work [36] studied feature relevancy under the name feature membership, but neither d-DNNF nor monotonic classifiers were discussed. Moreover, [36] only proved the hardness of deciding feature relevancy for DNF and DT classifiers and did not discuss the feature necessity problem. The results presented in this paper complement this work. Besides, the complexity results of FRP and FNP in this paper also complement the recent work [54] which summarizes the progress of formal explanations. [40] focused on the computation of one arbitrary AXp and one smallest AXp, which is orthogonal to our work. Computing one AXp does not guarantee that either FRP or FNP is decided, since the target feature t may not appear in the computed AXp. [53] studied the computation of one formal explanation and the enumeration of formal explanations in the case study of monotonic classifiers. However, neither FRP or FNP were identified and studied.

## 7 Conclusions

This paper studies the problems of feature necessity and relevancy in the context of formal explanations of ML classifiers. The paper proves several complexity results, some related with necessity, but most related with relevancy. Furthermore, the paper proposes two different approaches for solving relevancy for two families of classifiers, namely classifiers represented with the d-DNNF propositional language, and monotonic classifiers. The experimental results confirm the practical scalability of the proposed algorithms. Future work will seek to prove hardness results for the families of classifiers for which hardness is yet unknown.

Acknowledgements. This work was supported by the AI Interdisciplinary Institute ANITI, funded by the French program "Investing for the Future – PIA3" under Grant agreement no. ANR-19-PI3A-0004, and by the H2020-ICT38 project COALA "Cognitive Assisted agile manufacturing for a Labor force supported by trustworthy Artificial intelligence", and funded by the Spanish Ministry of Science and Innovation (MICIN) under project PID2019-111544GB-C22, and by a María Zambrano fellowship and a Requalification fellowship financed by Ministerio de Universidades of Spain and by European Union – NextGenerationEU.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Towards Formal XAI: Formally Approximate Minimal Explanations of Neural Networks

Shahaf Bassan and Guy Katz(B)

The Hebrew University of Jerusalem, Jerusalem, Israel {shahaf.bassan,g.katz}@mail.huji.ac.il

Abstract. With the rapid growth of machine learning, deep neural networks (DNNs) are now being used in numerous domains. Unfortunately, DNNs are "black-boxes", and cannot be interpreted by humans, which is a substantial concern in safety-critical systems. To mitigate this issue, researchers have begun working on explainable AI (XAI) methods, which can identify a subset of input features that are the cause of a DNN's decision for a given input. Most existing techniques are heuristic, and cannot guarantee the correctness of the explanation provided. In contrast, recent and exciting attempts have shown that formal methods can be used to generate provably correct explanations. Although these methods are sound, the computational complexity of the underlying verifcation problem limits their scalability; and the explanations they produce might sometimes be overly complex. Here, we propose a novel approach to tackle these limitations. We (i) suggest an efcient, verifcation-based method for fnding minimal explanations, which constitute a provable approximation of the global, minimum explanation; (ii) show how DNN verifcation can assist in calculating lower and upper bounds on the optimal explanation; (iii) propose heuristics that signifcantly improve the scalability of the verifcation process; and (iv) suggest the use of bundles, which allows us to arrive at more succinct and interpretable explanations. Our evaluation shows that our approach signifcantly outperforms stateof-the-art techniques, and produces explanations that are more useful to humans. We thus regard this work as a step toward leveraging verifcation technology in producing DNNs that are more reliable and comprehensible.

## 1 Introduction

Machine learning (ML) is a rapidly growing feld with a wide range of applications, including safety-critical, high-risk systems in the felds of health care [18], aviation [38] and autonomous driving [12]. Despite their success, ML models, and especially deep neural networks (DNNs), remain "black-boxes" — they are incomprehensible to humans and are prone to unexpected behaviour and errors. This issue can result in major catastrophes [13, 73], and also in poor decisionmaking due to brittleness or bias [7, 24].

In order to render DNNs more comprehensible to humans, researchers have been working on explainable AI (XAI), where we seek to construct models for explaining and interpreting the decisions of DNNs [50,55–57]. Work to date has focused on heuristic approaches, which provide explanations, but do not provide guarantees about the correctness or succinctness of these explanations [14,32,44]. Although these approaches are an important step, their limitations might result in skewed results, possibly failing to meet the regulatory guidelines of institutions and organizations such as the European Union, the US government, and the OECD [51]. Thus, producing DNN explanations that are provably accurate remains of utmost importance.

More recently, the formal verifcation community has proposed approaches for providing formal and rigorous explanations for DNN decision making [27,31, 51, 59]. Many of these approaches rely on the recent and rapid developments in DNN verifcation [1, 8, 9, 39]. These approaches typically produce an abductive explanation (also known as a prime implicant, or PI-explanation) [31, 58, 59]: a minimum subset of input features, which by themselves already determine the classifcation produced by the DNN, regardless of any other input features. These explanations aford formal guarantees, and can be computed via DNN verifcation [31].

Abductive explanations are highly useful, but there are two major difculties in computing them. First, there is the issue of scalability: computing locally minimal explanations might require a polynomial number of costly invocations of the underlying DNN verifer, and computing a globally minimal explanation is even more challenging [10, 31, 48]. The second difculty is that users may sometimes prefer "high-level" explanations, not based solely on input features, as these may be easier to grasp and interpret compared to "low-level", complex, feature-based explanations.

To tackle the frst difculty, we propose here new approaches for more efciently producing verifcation-based abductive explanations. More concretely, we propose a method for provably approximating minimum explanations, allowing stakeholders to use slightly larger explanations that can be discovered much more quickly. To accomplish this, we leverage the recently discovered dual relationship between explanations and contrastive examples [30]; and also take advantage of the sensitivity of DNNs to small adversarial perturbations [64], to compute both lower and upper bounds for the minimum explanation. In addition, we propose novel heuristics for signifcantly expediting the underlying verifcation process.

In addressing the second difculty, i.e. the interpretability limitations of "lowlevel" explanations, we propose to construct explanations in terms of bundles, which are sets of related features. We empirically show that using our method to produce bundle explanations can signifcantly improve the interpretability of the results, and even the scalability of the approach, while still maintaining the soundness of the resulting explanations.

To summarize, our contributions include the following: (i) We are the frst to suggest a method that formally produces sound and minimal abductive explanations that provably approximate the global-minimum explanation. (ii) Our three suggested novel heuristics expedite the search for minimal abductive explanations, signifcantly outperforming the state of the art. (iii) We suggest a novel approach for using bundles to efficiently produce sound and provable explanations that are more interpretable and succinct.

For evaluation purposes, we implemented our approach as a proof-of-concept tool. Although our method can be applied to any ML model, we focused here on DNNs, where the verification process is known to be NP-complete [39], and the scalable generation of explanations is known to be challenging [31, 58]. We used our tool to test the approach on DNNs trained for digit and clothing classification, and also compared it to state-of-the-art approaches [31,32]. Our results indicate that our approach was successful in quickly producing meaningful explanations, often running 40% faster than existing tools. We believe that these promising results showcase the potential of this line of work.

The rest of the paper is organized as follows. Sec. 2 contains background on DNNs and their verification, as well as on formal, minimal explanations. Sec. 3 covers the main method for calculating approximations of minimum explanations, and Sec. 4 covers methods for improving the efficiency of calculating these approximations. Sec. 5 covers the use of bundles in constructing "high-level", provable explanations. Next, we present our evaluation in Sec. 6. Related work is covered in Sec. 7, and we conclude in Sec. 8.

## 2 Background

DNNs. A deep neural network (DNN) [46] is a directed graph composed of layers of nodes, commonly called neurons. In feed-forward NNs the data flows from the first (input) layer, through intermediate (hidden) layers, and onto an output layer. A DNN's output is calculated by assigning values to its input neurons, and then iteratively calculating the values of neurons in subsequent layers. In the case of classification, which is the focus of this paper, each output neuron corresponds to a specific class, and the output neuron with the highest value corresponds to the class the input is classified to.

Fig. 1 depicts a simple, feed-forward DNN. The input layer includes three neurons, followed by a weighted sum layer, which calculates an affine transformation of values from the input layer. Given the input V<sup>1</sup> = [1, 1, 1]<sup>T</sup> , the second layers computes the values V<sup>2</sup> = [6, 9, 11]<sup>T</sup> . Next comes a ReLU layer, which computes the function ReLU(x) = max(0, x) for each neuron in the preceding layer, resulting in

Fig. 1: A simple DNN.

V<sup>3</sup> = [6, 9, 11]<sup>T</sup> . The final (output) layer then computes an affine transformation, resulting in V<sup>4</sup> = [15,−4]<sup>T</sup> . This indicates that input V<sup>1</sup> = [1, 1, 1]<sup>T</sup> is classified as the category corresponding to the first output neuron, which is assigned the greater value.

DNN Verification. A DNN verification query is a tuple ⟨P,N,Q⟩, where N is a DNN that maps an input vector x to an output vector y = N(x), P is a predicate on x, and Q is a predicate on y. A DNN verifer needs to decide whether there exists an input x<sup>0</sup> that satisfes P(x0) ∧ Q(N(x0)) (the SAT case) or not (the UNSAT case). Typically, P and Q are expressed in the logic of real arithmetic [49]. The DNN verifcation problem is known to be NP-Complete [39].

Formal Explanations. We focus here on explanations for classifcation problems, where a model is trained to predict a label for each given input. A classifcation problem is a tuple ⟨F, D, K, N⟩ where (i) F = {1, ..., m} denotes the features; (ii) D = {D1, D2..., Dm} denotes the domains of each of the features, i.e. the possible values that each feature can take. The entire feature (input) space is hence F = D<sup>1</sup> × D<sup>2</sup> × ... × Dm; (iii) K = {c1, c2, ..., cn} is a set of classes, i.e. the possible labels; and (iv) N ∶ F → K is a (non-constant) classifcation function (in our case, a neural network). A classifcation instance is the pair (v, c), where v ∈ F, c ∈ K, and c = N(v). In other words, v is mapped by the neural network N to class c.

Looking at (v, c), we often wish to know why v was classifed as c. Informally, an explanation is a subset of features E ⊆ F, such that assigning these features to the values assigned to them in v already determines that the input will be classifed as c, regardless of the remaining features F ∖ E. In other words, even if the values that are not in the explanation are changed arbitrarily, the classifcation remains the same. More formally, given input v = (v1, ...vm) ∈ F with the classifcation N(v) = c, an explanation (sometimes referred to as an abductive explanation, or an AXP) is a subset of the features E ⊆ F, such that:

$$\forall \{x \in \mathbb{F}\}. \quad \left[\bigwedge\_{i \in E} (x\_i = v\_i) \to \{N(x) = c\}\right] \tag{1}$$

We continue with the running example from Fig. 1. For simplicity, we assume that each input neuron can only be assigned the values 0 or 1. It can be observed that for input V<sup>1</sup> = [1, 1, 1] T , the set {v 1 1 , v<sup>2</sup> <sup>1</sup>} is an explanation; indeed, once the frst two entries in V<sup>1</sup> are set to 1, the classifcation remains the same for any value of the third entry (see Fig. 2). We can prove this by encoding a verifcation query ⟨P, N, Q⟩ = ⟨E = v, N, Q¬<sup>c</sup>⟩, where E is the candidate explanation, and E = v means that we restrict the features in E to their values in v; and Q¬<sup>c</sup> implies that the classifcation is not c. An UNSAT result for this query indicates that E is an explanation for instance (v, c).

Clearly, the set of all features constitutes a trivial explanation. However, we are interested in smaller explanation subsets, which can provide useful information regarding the decision of the classifer. More precisely, we search for minimal explanations and minimum explanations. A subset E ⊆ F is a minimal explanation (also referred to as a local-minimal explanation, or a subset-minimal explanation) of instance (v, c) if it is an explanation that ceases to be an explanation if even a single feature is removed from it:

$$\begin{aligned} \{\forall \{x \in \mathbb{F}\}. \left[\land\_{i \in E} \{x\_i = v\_i\} \to \{N\{x\} = c\}\right] \} \land\\ \{\forall \{j \in E\}. \left[\exists \{y \in \mathbb{F}\}. \left[\land\_{i \in E \cup j} \{y\_i = v\_i\} \land \{N(y) \neq c\}\right] \right] \end{aligned} \tag{2}$$

Fig. 3 demonstrates that {v 1 1 , v<sup>2</sup> <sup>1</sup>} is a minimal explanation in our running example: removing any of its features allows mis-classifcation.

Fig. 2: {v<sup>1</sup> , v<sup>2</sup> } is an explanation for input V<sup>1</sup> = [1, 1, 1]<sup>T</sup> .

Fig. 3: {v<sup>1</sup> , v<sup>2</sup> } is a minimal explanation for input V<sup>1</sup> = [1, 1, 1]<sup>T</sup> .

A minimum explanation (sometimes referred to as a cardinal minimal explanation or a PI-explanation) is defined as a minimal explanation of minimum size; i.e., if E is a minimum explanation, then there does not exist a minimal explanation E′ ≠ E such that ∣E′ ∣ < ∣E∣. Fig. 4 demonstrates that {v<sup>3</sup> } is a minimum explanation for our running example.

Fig. 4: {v<sup>3</sup> } is a minimum explanation for input V<sup>1</sup> = [1, 1, 1]<sup>T</sup> .

Contrastive Example. A subset of features C ⊆ F is called a contrastive example or a contrastive explanation (CXP) if altering the features in C is sufficient to cause the misclassification of a given classification instance (v, c):

$$\exists \{x \in \mathbb{F}\}. \left[ \land\_{i \in F \land C} \left( x\_i = v\_i \right) \land \left( N(x) \neq c \right) \right] \tag{3}$$

A contrastive example for our running example is shown in Fig. 5. Notice that the question of whether a set is a contrastive example can be encoded into a verification query ⟨P,N,Q⟩ = ⟨(F ∖ C) = v,N,Q¬c⟩, where a SAT result indicates that C is a contrastive example. As with explanations, smaller contrastive examples are more valuable than large ones. One useful notion is that of a contrastive singleton: a

Fig. 5: {v<sup>2</sup> 1, v<sup>3</sup> <sup>1</sup>} is a contrastive example for V<sup>1</sup> = [1, 1, 1]<sup>T</sup> .

contrastive example of size one. A contrastive singleton could represent a specific pixel in an image, the alteration of which could result in misclassification. Such singletons are leveraged in "one-pixel attacks" [64] (see Fig. 16 in the appendix of the full version of this paper [11]). Contrastive singletons have the following important property:

### Lemma 1. Every contrastive singleton is contained in all explanations.

The proof appears in Sec. A of the appendix of the full version of this paper [11]. Lemma 1 implies that each contrastive singleton is contained in all minimal/minimum explanations.

We consider also the notion of a contrastive pair, which is a contrastive example of size 2. Clearly, for any pair of features (u, v) where u or v are contrastive singletons, (u, v) is a contrastive pair; however, when we next refer to contrastive pairs, we consider only pairs that do not contain any contrastive singletons. Likewise, for every k > 2, we can consider contrastive examples of size k, and we exclude from these any contrastive examples of sizes 1,...,k − 1 as subsets.

We state the following theorem, whose proof also appears in Sec. A of the appendix of the full version of this paper [11]:

Lemma 2. All explanations contain at least one element of every contrastive pair.

The theorem can be generalized to any k > 2; and can be used in showing that the minimum hitting set (MHS) of all contrastive examples is exactly the minimum explanation [29, 54] (see Sec. B of the appendix of the full version of this paper [11]). Further, the theorem implies a duality between contrastive examples and explanations [30, 34]: a minimal hitting set of all contrastive examples constitutes a minimal explanation, and a minimal hitting set of all explanations constitutes a minimal contrastive example.

## 3 Provable Approximations for Minimal Explanations

State-of-the-art approaches for finding minimum explanations exploit the MHS duality between explanations and contrastive examples [31]. The idea is to iteratively compute contrastive examples, and then use their MHS as an underapproximation for the minimum explanation. Finding this MHS is an NP- complete problem, and is difcult in practice as the number of contrastive examples increases [20]; and although the MHS can be approximated using maximum satisfability (MaxSAT) or mixed integer linear programming (MILP) solvers [26, 47], existing approaches tackle simpler ML models, such as decision trees [33,36], but face scalability limitations when applied to DNNs [31,58]. Further, enumerating all contrastive examples may in itself take exponential time. Finally, recall that DNN verifcation is an NP-Complete problem [39]; and so dispatching a verifcation query to identify each explanation or contrastive example is also very slow, when the feature space is large. Finding minimal explanations may be easier [31], but may converge to larger and less meaningful explanations, while still requiring a linear number of calls to the underlying verifer. Our approach, described next, seeks to mitigate these difculties.

Our overall approach is described in Algorithm 1. It is comprised of two separate threads, intended to be run in parallel. The upper bounding thread (TUB) is responsible for computing a minimal explanation. It starts with the entire feature space, and then gradually reduces it, until converging to a minimal explanation. The size of the presently smallest explanation is regarded as an upper bound (UB) for the size of the minimum explanation. Symmetrically, the lower bounding thread (TLB) attempts to construct small contrastive sets, used for computing a lower bound (LB) on the size of the minimum explanation. Together, these two bounds allow us to compute the approximation ratio between the minimal explanation that we have discovered and the minimum explanation. For instance, given a minimal explanation of size 7 and a lower bound of size 5, we can deduce that our explanation is at most UB LB = 7 5 times larger than the minimum. The two threads share global variables that indicate the set of contrastive singletons (Singletons), the set of contrastive pairs (Pairs), the upper and lower bounds (UB, LB), and the set of features that were determined not to participate in the explanation and are "free" to be set to any value (Free). The output of our algorithm is a minimal explanation (F∖Free), and the approximation ratio ( UB LB ). We next discuss each of the two threads in detail.


The Upper Bounding Thread (TUB). This thread, whose pseudocode appears in Algorithm 2, follows the framework proposed by Ignatiev et al. [31]: it seeks a minimal explanation by starting with the entire feature space, and then iteratively attempting to remove individual features. If removing a feature allows misclassifcation, we keep it as part of the explanation; otherwise, we remove it

and continue. This process issues a single verifcation query for each feature, until converging to a minimal explanation (lines 2–8). Although this na¨ıve search is guaranteed to converge to a minimal explanation, it needs not to converge to a minimum explanation; and so we apply a more sophisticated ordering scheme, similar to the one proposed by [32], where we use some heuristic model as a way for assigning weights of importance to each input feature. We then check the "least important" input features frst, since freeing them has a lower chance of causing a misclassifcation, and they are consequently more likely to be successfully removed. We then continue iterating over features in ascending order of importance, hopefully producing small explanations.


The Lower Bounding Thread (TLB). The pseudocode for the lower bounding thread (TLB) appears in Algorithm 3. In lines 1–6, the thread searches for contrastive singletons. Neural networks were shown to be very sensitive to adversarial attacks [25] — slight input perturbations that cause misclassifcation (e.g., the aforementioned one-pixel attack [64]) — and this suggests that contrastive sets, and in particular contrastive singletons, exist in many cases. We observe that identifying contrastive singletons is computationally cheap: by encoding Eq. 3 as a verifcation query, once for each feature, we can discover all singletons; and in these queries all features but one are fxed, which empirically allows verifers to dispatch them quickly.

The rest of TLB (lines 9–13) performs a similar process, but with contrastive pairs (which do not contain contrastive singletons as one of their features). We use verifcation queries to identify all such pairs, and then attempt to fnd their MHS. We observe that fnding the MHS of all contrastive pairs is the 2-MHS problem, which is a reformalization of the minimum vertex cover problem (see Sec. B of the appendix of the full version of this paper [11]). Since this is an easier problem than the general MHS problem, solving it with MAX-SAT or MILP often converges quickly. In addition, the minimum vertex cover algorithm has a linear 2-approximating greedy algorithm, which can be used for fnding a lower bound in cases of large feature spaces.

More formally, TLB performs an efcient computation of the following bound:

$$\text{LB} = \left| \text{Singletsons} \right| + \left| \text{MVC(Pairs)} \right| \le \text{MHS}(\text{Cxys}) = E\_M \tag{4}$$


Algorithm 3 TLB: Lower Bounding Thread

where MVC is the minimum vertex cover, Cxps denotes the set of all contrastive examples, and E<sup>M</sup> is the size of the minimum explanation.

It is worth mentioning that this approach can be extended to use contrastive examples of larger sizes (k = 3, 4, . . .), as specifed in Sec. C of the appendix of the full version of this paper. The fact that small contrastive examples, such as singletons, exist in large, state-of-the-art DNNs with large inputs [21, 64] suggests that useful approximations exist in large DNNs. In our experiments, we observed that using only singletons and pairs afords good approximations, without incurring overly expensive computations by the underlying verifer.

## 4 Finding Minimal Explanations Efciently

Algorithm 1 is the backbone of our approach, but it sufers from limited scalability — particularly, in TUB. As the execution of TUB progresses, and as additional features are "freed", the quickly growing search space slows down the underlying verifer. Here we propose three diferent methods for expediting this process, by reducing the number of verifcation queries required.

Method 1: Using Information from TLB. We suggest to leverage the contrastive examples found by TLB to expedite TUB. The process is described in Algorithm 4. In line 3, TLB is queried for the current set of contrastive singletons, which we know must be part of any minimal explanation. These are subtracted from the RemainingFeatures set (features left for TUB to query), and consequently will not be added to the Free set — i.e., they are marked as part of the current explanation. In addition, for any contrastive pair (a, b) found by TLB, either a or b must appear in any minimal explanation; and so, our algorithm skips checking the case where both a and b are removed from F (Line 8). (the method could also be extended to contrastive sets of greater cardinality.)


Method 2: Binary Search. Sorting the features being considered in ascending order of importance can have a signifcant efect on the size of the explanation found by Algorithm 2. Intuitively, a "perfect" heuristic model would assign the greatest weights to all features in the minimum explanation, and so traversing features in ascending order would frst discover all the features that can be removed (UNSAT verifcation queries), followed by all the features that belong in the explanation (SAT queries). In this case, a sequential traversal of the features in ascending order is quite wasteful, and it is much better to perform a binary search to fnd the point where the answer fips from UNSAT to SAT.

Of course, in practice, the heuristic models are not perfect, leading to potential cases with multiple "fips" from SAT to UNSAT, and vice versa. Still, if the heuristic is good in practice (which is often the case; see Sec. 6), these fips are scarce. Thus, we propose to perform multiple binary searches, each time identifying one SAT query (i.e., a feature added to the explanation). Observe that each time we hit an UNSAT query, this indicates that all the queries for features with lower priorities would also yield UNSAT — because if "freeing" multiple features cannot change the classifcation, changing fewer features certainly cannot. Thus, we are guaranteed to fnd the frst SAT query in each iteration, and soundness is maintained. This process is described in Algorithm. 6 and in Fig. 14 in the appendix of the full version of this paper [11].

Method 3: Local-Singleton Search. Let N be a DNN, and let x be an input point whose classifcation we seek to explain. As part of Algorithm 2, TUB iteratively "frees" certain input features, allowing them to take arbitrary values, as it continues to search for features that must be included in the explanation. The increasing number of free features enlarges the search space that the underlying verifer must traverse, thus slowing down verifcation. We propose to leverage the hypothesis that input points nearby x that are misclassifed tend to be clustered; and so, it is benefcial to fx the free features to "bad" values, as opposed to letting them take on arbitrary values. We speculate that this will allow the verifer to discover satisfying assignments much more quickly.

This enhancement is shown in Algorithm 5. Given a set Free of features that were previously freed, we fx their values according to some satisfying assignment previously discovered. Thus, the verifcation of any new feature that we consider is similar to the case of searching for contrastive singletons, which, as we already know, is fairly fast. See Fig. 15 in the appendix of the full version of this paper [11] for an illustration. The process can be improved further by fixing the freed features to small neighborhoods of the previously discovered satisfying assignment (instead of its exact values), to allow some flexibility while still keeping the query's search space small.



## 5 Minimal Bundle Explanations

So far, we presented methods for generating explanations within a given approximation ratio of the minimum explanation (Sec. 3), and for expediting the computation of these explanations (Sec. 4) — in order to improve the scalability of our explanation generation mechanism. Next, we seek to tackle the second challenge from Sec. 1, namely that these explanations may be too low-level for many users. To address this challenge, we focus on bundles, which is a topic well covered in the ML [63] and heuristic XAI literature [50,55] (commonly known as "super-pixels" for computer-vision tasks). Intuitively, bundles are a partitioning of the features into disjoint sets (an

Fig. 6: Partition input's features into bundles.

illustration appears in Fig. 6). The idea, which we later validate empirically, is that providing explanations in terms of bundles is often easier for humans to comprehend. As an added bonus, using bundles also curtails the search space that the verifier must traverse, expediting the process even further.

Given a feature space F = {1, ..., m}, a bundle b is just a subset b ⊆ F. When dealing with the set of all bundles B = {b1, b2, ...bn}, we require that they form a partitioning of F, namely F = ⊍b<sup>i</sup> . We defne a bundle explanation E<sup>B</sup> for a classifcation instance (v, c) as a subset of bundles, E<sup>B</sup> ⊆ B, such that:

$$\forall \{x \in \mathbb{F}\}. \left[ \land\_{i \in \cup E\_B} \left( x\_i = v\_i \right) \to \left( N(x) = c \right) \right] \tag{5}$$

The following theorem then connects bundle explanations and explicit, nonbundle explanations:

#### Theorem 1. The union of features in a bundle explanation is an explanation.

The proof directly follows from Eqs. 1 and 5. We note that this defnition of bundles implies that features that are not part of the bundle explanation (i.e. features contained in "free" bundles) are "free" to be set to any possible value. Another possible alternative for defning bundles could be to allow features in "free" bundles to only change in the same, coordinated manner. We focus here on the former defnition, and leave the alternative defnition for future work.

Many of the aforementioned results and defnitions for explanations can be extended to bundle explanations. In a similar manner to Eq. 5, we can defne the notions of minimal and minimum bundle explanations, a contrastive bundle singleton, and contrastive bundle pairs (see Sec. D of the appendix of the full version of this paper [11]). Theorems 1 and 2 can be extended to bundle explanations in a straightforward manner. It then follows that all bundle explanations contain all contrastive singleton bundles, and that all bundle explanations contain at least one bundle of any contrastive bundle pair.

Our method from Secs. 3 and 4 can be similarly performed on bundles rather than on features, and TUB would then be used for calculating a minimal bundle explanation, rather than a minimal explanation. Regarding the aforementioned approximation ratio, we discuss and evaluate two diferent methods for obtaining it. The frst, natural approach is to apply our techniques from Sec. 3 on bundle explanations, thus obtaining a provable approximation for a minimum bundle explanation. The upper bound is trivially derived by the size of the bundle explanation found by TUB, whereas the lower bound calculation requires assigning a cost to each bundle, representing the number of features it contains. This is done via a known notion of minimum hitting sets of bundles (MHSB) [6] and using minimum weighted vertex cover for the approximation of contrastive bundle pairs. This method, which is almost identical to the one mentioned in Sec. 3, is formalized in Sec. D of the appendix of the full version of this paper [11].

The second approach is to calculate an approximation ratio with respect to a regular, non-bundle minimum explanation. The minimal bundle explanation found by TUB is an upper bound on the minimum non-bundle explanation following theorem 5. For computing a lower bound, we can analyze contrastive bundle examples; extract from them contrastive non-bundle examples; and then use the duality property, compute an MHS of these contrastive examples, and derive lower bounds for the size of the minimum explanation. We formalize techniques for performing this calculation in Sec. E of the appendix of the full version of this paper [11].

## 6 Evaluation

Implementation and Setup. For evaluation purposes, we created a proofof-concept implementation of our approach as a Python framework. Currently, the framework uses the Marabou verifcation engine [41] as a backend, although other engines may be used. Marabou is a Simplex-based DNN verifcation framework that is sound and complete [5,39–41,68,69], and which includes support for proof production [35], abstraction [15, 16, 52, 60, 67, 72], and optimization [62]; and has been used in various settings, such as ensemble selection [3], simplifcation [22, 43] repair [23, 53], and verifcation of reinforcement-learning based systems [2,4,17]. For sorting features by their relevance, we used the popular XAI method LIME [55]; although again, other heuristics could be used. The MVC was calculated using the classic 2-approximating greedy algorithm. All experiments reported were conducted on x86-64 Gnu/Linux-based machines, using a single Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz core, with a 1-hour timeout.

Benchmarks. As benchmarks, we used DNNs trained over the MNIST dataset for handwritten digit recognition [45]. These networks classify 28 × 28 grayscale images into the digits 0, . . . , 9. Additionally, we used DNNs trained over the Fashion-MNIST dataset [71], which classify 28 × 28 grayscale images into 10 clothing categories ("Dress", "Coat", etc.) For each of these datasets we trained a DNN with the following architecture: (i) an input layer (which corresponds to the image) of size 784; (ii) a fully connected hidden layer with 30 neurons; (iii) another fully connected hidden layer, with 10 neurons; and (iv) a fnal, softmax layer with 10 neurons, corresponding to the 10 possible output classes. The accuracy of the MNIST DNN was 96.6%, whereas that of the Fashion-MNIST DNN was 87.6%. (We note that we confgured LIME to ignore the external border pixels of each input, as these are not part of the actual image.)

In selecting the classifcation instances to be explained for these networks, we targeted input points where the network was not confdent — i.e., where the winning label did not win by a large margin. The motivation for this choice is that explanations are most useful and relevant in cases where the network's decision is unclear, which is refected in lower confdence scores. Additionally, explanations of instances with lower confdence tend to be larger, facilitating the process of extensive experimentation. We thus selected the 100 inputs from the MNIST and the Fashion-MNIST datasets where the networks demonstrated the lowest confdence scores — i.e., where the diference between the winning output score and the runner-up class score was minimal.

Experiments. Our frst goal was to compare our approach to that of Ignatiev et al. [31], which is the current state of the art in verifcation-based explainability of DNNs. Other approaches consider other ML types, such as decision trees [33,36], or focus on alternative defnitions for abductive explanations [42, 70] and are thus not comparable. Because the implementation used in [31] is unavailable, we implemented their approach, using Marabou as the underlying verifer for a fair comparison. In addition, we used the same heuristic model, LIME, for sorting

fied to participate in the explanation.

Fig. 7: Our full and ablation-based results, compared to the state of the art for finding minimal explanations on the MNIST dataset.

the input features' relevance. Fig. 7 depicts a comparison of the two approaches, over the MNIST benchmarks. The Fashion-MNIST results were similar, but since the Fashion-MNIST network had lower accuracy it tended to produce larger explanations with lower run-times, resulting in less meaningful evaluations (due to space limitations, these results appear in Fig. 12 in the appendix of the full version of this paper [11]). We compared the approaches according to two criteria: the portion of input features whose participation in the explanation was verified, over time (part (a) of Fig. 7), and the average size of the presently obtained explanation over time, also presented as a fraction of the total number of input features (part (b)). The results indicate that our method significantly improves over the state of the art, verifying the participation of 40.4% additional features, on average, and producing explanations that are 9.7% smaller, on average, at the end of the 1-hour time limit. Furthermore, our method timed out on 10% fewer benchmarks. We regard this as compelling evidence of the potential of our approach to produce more efficient verification-based XAI.

We also looked into comparing our approach to heuristic, non-verificationbased approaches, such as LIME itself; but these comparisons did not prove to be meaningful, as the heuristic approaches typically solved benchmarks very quickly, but very often produced incorrect explanations. This matches the findings reported in previous work [14, 32].

Next, we set out to evaluate the contribution of each of the components implemented within our framework to overall performance, using an ablation study. Specifically, we ran our framework with each of the components mentioned in Sec. 4, i.e. (i) information exchange between TUB and TLB; (ii) the binary search in TUB; and (iii) local-singleton search, turned off. The results on the MNIST benchmarks appear in Fig. 7; see Fig. 12 in the appendix of the full version of this paper [11] for the Fashion-MNIST results. Our experiments revealed that each of the methods mentioned in Sec. 4 had a favorable impact on both the average portion of features verified, and the average size of the discovered explanation, over time. Fig 7a indicates that the local-singleton search method, used for efficiently proving that features are bound to be included in the explanation, was the most significant in reducing the number of features remained for verifying, thus substantially increasing the portion of verified features. Moreover, Fig. 7b indicates that the binary search method, which is used for grouping UNSAT queries and proving the exclusion of features from the explanation, was the most significant for more efficiently obtaining smaller-sized explanations, over time.

Our second goal was to evaluate the quality of the minimum explanation approximation of our method (using the lower/upper bounds) over time. Results are averaged over all benchmarks of the MNIST dataset and are presented in Fig. 8 (similar results on Fashion-MNIST appear in Fig. 13 in the appendix of the full version of this paper [11]). The upper bound represents the average size of the explanation discovered by TUB over time, whereas the lower bound represents the average lower bound discovered by TLB over time. It can be seen that initially, there is a steep increase in

Fig. 8: Average approximation of minimum explanation over time.

the size of the lower bound, as TLB discovered many contrastive singletons. Later, as we begin iterating over contrastive pairs, the verification queries take longer to solve, and progress becomes slower. The average approximation ratio achieved after an hour was 1.61 for MNIST and 1.19 for Fashion-MNIST.

For our third experiment, we set out to assess the improvements afforded by bundles. We repeated the aforementioned experiments, this time using sets of features representing bundles instead of the features themselves. The segmentation into bundles was performed using the quickshift method [65], with LIME again used for assigning relevance to each bundle [55]. We approximate the sizes of the bundle explanations in terms of both the minimum bundle explanation as well as the minimum (non-bundle) explanation (as mentioned in Sec. 5 and in Sec. E of the appendix of the full version of this paper [11]). The bundle configuration showed drastic efficiency improvements, with none of the experiments timing out within the 1-hour time limit, thus improving the portion of timeouts on the MNIST dataset by 84%. The efficiency improvement was obtained at the expense of explanation size, resulting in a decrease of 352% in the approximation ratios obtained for MNIST and 39% for Fashion-MNIST. Nevertheless, when calculating the approximation in terms of the minimum bundle explanation, an increase of 12% and 8% was obtained for MNIST and Fashion-MNIST (results are summarized in Table 1 in the appendix of the full version of this paper [11]). For a visual evaluation, we performed the same set of experiments for both bundle and non-bundle implementations, using instances with high confidence rates to obtain smaller-sized explanations that could be more easily interpreted. A

Fig. 9: Minimal explanations and bundle explanations found by our method on the Fashion-MNIST dataset. White pixels are not part of the explanation.

sample of these results is presented in Fig. 9. Empirically, we observe that the bundle-produced explanations are less complex and more comprehensible.

Overall, we regard our results as compelling evidence that verification-based XAI can soundly produce meaningful explanations, and that our improvements can indeed significantly improve its runtime.

## 7 Related Work

Our work is another step in the ongoing quest for formal explainability of DNNs, using verification [19, 27, 31, 58]. Related approaches have applied enumeration of contrastive examples [30, 31], which is also an ingredient of our approach. Other approaches focus on producing abductive explanations around an epsilon environment [42, 70]. Similar work has been carried out for decision sets [33], lists [28] and trees [36], where the problem appears to be simpler to solve [36]. Our work here tackles DNNs, which are known to be more difficult to verify [39].

Prior work has also sought to produce approximate explanations, e.g., by using δ-relevant sets [37,66]. This line of work has focused on probabilistic methods for generating explanations, which jeopardizes soundness. There has also been extensive work in heuristic XAI [50, 55, 56, 61], but here, too, the produced explanations are not guaranteed to be correct.

## 8 Conclusion

Although DNNs are becoming crucial components of safety-critical systems, they remain "black-boxes", and cannot be interpreted by humans. Our work seeks to mitigate this concern, by providing formally correct explanations for the choices that a DNN makes. Since discovering the minimum explanations is difficult, we focus on approximate explanations, and suggest multiple techniques for expediting our approach — thus significantly improving over the current state of the art. In addition, we propose to use bundles to efficiently produce more meaningful explanations. Moving forward, we plan to leverage lightweight DNN verification techniques for improving the scalability of our approach [49], as well as extend it to support additional DNN architectures.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/ by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## OccRob: Efcient SMT-Based Occlusion Robustness Verifcation of Deep Neural Networks

Xingwu Guo <sup>1</sup> , Ziwei Zhou <sup>1</sup> , Yueling Zhang <sup>1</sup> , Guy Katz <sup>2</sup> , Min Zhang 1()

> <sup>1</sup> Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai, China zhangmin@sei.ecnu.edu.cn <sup>2</sup> The Hebrew University of Jerusalem, Jerusalem, Isarel

Abstract. Occlusion is a prevalent and easily realizable semantic perturbation to deep neural networks (DNNs). It can fool a DNN into misclassifying an input image by occluding some segments, possibly resulting in severe errors. Therefore, DNNs planted in safety-critical systems should be verifed to be robust against occlusions prior to deployment. However, most existing robustness verifcation approaches for DNNs are focused on non-semantic perturbations and are not suited to the occlusion case. In this paper, we propose the frst efcient, SMT-based approach for formally verifying the occlusion robustness of DNNs. We formulate the occlusion robustness verifcation problem and prove it is NP-complete. Then, we devise a novel approach for encoding occlusions as a part of neural networks and introduce two acceleration techniques so that the extended neural networks can be efciently verifed using of-the-shelf, SMT-based neural network verifcation tools. We implement our approach in a prototype called OccRob and extensively evaluate its performance on benchmark datasets with various occlusion variants. The experimental results demonstrate our approach's efectiveness and efciency in verifying DNNs' robustness against various occlusions, and its ability to generate counterexamples when these DNNs are not robust.

## 1 Introduction

Deep neural networks (DNNs) are computer-trained *programs* that can implement hard-to-formally-specify tasks. They have repeatedly demonstrated their potential in enabling artifcial intelligence in various domains, such as face recognition [6] and autonomous driving [27]. They are increasingly being incorporated into safety-critical applications with interactive environments. To ensure the security and reliability of these applications, DNNs must be highly dependable against adversarial and environmental perturbations. This dependability property is known as *robustness* and is attracting a considerable amount of research eforts from both academia and industry, aimed at ensuring robustness via diferent technologies such as adversarial training [13,28], testing [40,33], and formal verifcation [34,10,5].

Occlusion is a prevalent kind of perturbation, which may cause DNNs to misclassify an image by occluding some segment thereof [38,25,8]. For instance, a "turn left" trafc sign may be misclassifed as "go straight" after it is occluded by a tape, probably resulting in trafc accidents. A similar situation may occur in face recognition, where many welltrained neural networks fail to recognize faces correctly when they are partially occluded, such as when glasses are worn[37]. A neural network is called *robust against occlusions*

if small occlusions do not alter its classifcation results. Generally, we wish a DNN to be robust against occlusions that appear negligible to humans.

It is challenging to verify whether a DNN is robust or not on an input image if the image is occluded. On the one hand, the verifcation problem is non-convex due to the non-linear activation functions in DNNs. It is NP-complete even when dealing with common, fully connected feed-forward neural networks (FNNs) [20]. On the other hand, unlike existing perturbations, occlusions are challenging to encode using *L<sup>p</sup>* norms. Most existing robustness verifcation approaches assume that perturbations need to be defned by *L<sup>p</sup>* norms and then apply approximations and abstract interpretation techniques [34,10,5] as part of the verifcation process. The semantic efect of occlusions partially alters the values of some neighboring pixels from large to small or in the inverse direction, e.g., 255 to 0, when a black occlusion occludes a white pixel. Therefore, existing techniques for perturbations in *L<sup>p</sup>* norms are not suited to occlusion perturbations.

SMT-based approaches have been shown to be an efcient approach to DNN verifcation [20]. They are both sound and complete, in that they always return defnite results and produce counterexamples in non-robust cases. We show that, although it is straightforward to encode the occlusion robustness verifcation problem into SMT formulas, solving the constraints generated by this naïve encoding is experimentally beyond the reach of state-of-the-art SMT solvers, due to the inclusion of a large number of the piece-wise ReLU activation functions. Consequently, such a straightforward encoding approach cannot scale to large networks.

In this paper, we systematically study the occlusion robustness verifcation problem of DNNs. We frst formalize and prove that the problem is NP-complete for ReLUbased FNNs. Then, we propose a novel approach for encoding various occlusions and neural networks together to generate new equivalent networks that can be efciently verifed using of-the-shelf SMT-based robustness verifcation tools such as Marabou [21]. In our encoding approach, although additional neurons and layers are introduced for encoding occlusions, the number is reasonably small and independent of the networks to be verifed. The efciency improvement of our approach comes from the fact that our approach signifcantly reduces the number of constraints introduced while encoding the occlusion and leverages the backend verifcation tool's optimization against the neural network structure. Furthermore, we introduce two acceleration techniques, namely inputspace splitting to reduce the search space of a single verifcation, which can signifcantly improve verifcation efciency, and label sorting to help verifcation terminates earlier. We implement a tool called OccRob with Marabou as the backend verifcation tool. To our knowledge, this is the frst work on formally verifying the occlusion robustness of deep neural networks.

To demonstrate the efectiveness and efciency of OccRob, we evaluate it on six representative FNNs trained on two benchmark datasets. The empirical results show that our approach is efective and efcient in verifying various types of occlusions with respect to the occlusion position, size, and occluding pixel value.

Contributions. We make the following three major contributions: (i) we propose a novel approach for encoding occlusion perturbations, by which we can leverage *o*f*-the-shelf* SMT-based robustness verifcation tools to verify the robustness of neural networks against various occlusion perturbations; (ii) we prove the verifcation problem of the occlusion robustness is NP-complete and introduce two acceleration techniques, i.e., label sorting and input space splitting, to improve the efficiency of verifcation further; and (iii) we implement a tool called OccRob and conduct experiments extensively on a collection of benchmarks to demonstrate its effectiveness and efficiency.

Paper Organization. Sec. 2 introduces preliminaries. Sec. 3 formulates the occlusion robustness verifcation problem and studies its complexity. Sec. 4 presents our encoding approach and acceleration techniques for the verifcation. Sec. 5 shows the experimental results. Sec. 6 discusses related work, and Sec. 7 concludes the paper.

We omit the complete proofs and experimental results due to the page limit. Please refer to the technical report [15] for more details.

## 2 Preliminaries

#### 2.1 Deep Neural Networks and the Robustness

As shown in Fig. 1, a deep neural network consists of multiple layers. The neurons on the input layer take input values, which are computed and propagated through the hidden layers and then output by the output layer. The neurons on each layer are connected to those on the predecessor and successor layers. We only consider fully connected, feedforward networks (FNNs) [11].

Given a λ-layer neural network, let *W*(*i*) be the weight matrix between the (*i* − 1)-th

Fig. 1: A fully-connected feed-forward neural network (FNN).

and *i*-th layers, and b(*i*) be the biases of the corresponding neurons, where *i* = 1, 2,...,λ. The network implements a function *F* : R*<sup>u</sup>* → R*<sup>r</sup>* that is recursively defned by:

$$\begin{aligned} z^{(0)} &= x\\ z^{(i)} &= \sigma(W^{(i)} \cdot z^{(i-1)} + \mathbf{b}^{(i)}), \text{ for } i = 1, \ldots, \lambda - 1\\ F(\mathbf{x}) &= W^{(\lambda)} \cdot z^{(\lambda - 1)} + \mathbf{b}^{(\lambda)} &\quad \text{(Network Function)} \end{aligned} \tag{\text{Layer Function}}$$

where σ(·) is called an *activation function* and *z*(*i*) denotes the result of neurons at the *i*-th layer.

For example, Fig. 1 shows a 3-layer neural network with three input neurons and two output neurons, namely, λ = 3, *u* = 3 and *r* = 2.

For the sake of simplicity, we use Φ*F*(*x*) = *arg max*∈*<sup>L</sup> F*(*x*) to denote the label such that the probability *F*(*x*) of classifying *x* to is larger than those to other labels, where *L* represents the set of labels. The activation function σ usually can be a piece-wise Rectifed Linear Unit (ReLU), σ(*x*) = *max*(*x*, 0)), or S-shape functions like Sigmoid σ(*x*) = <sup>1</sup> <sup>1</sup>+*e*−*<sup>x</sup>* , Tanh <sup>σ</sup>(*x*) <sup>=</sup> *<sup>e</sup>x*−*e*−*<sup>x</sup> ex*+*e*−*<sup>x</sup>* , or Arctan σ(*x*) = *tan*−1(*x*). In this work, we focus on the networks that only contain ReLU activation functions, which are widely adopted in real-world applications.

Fig. 2: Two multiform and uniform occlusions to traffic signs causing mis-classifcations.

A neural network is called *robust* if small perturbations to its inputs do not alter the classifcation result [39]. Specifcally, given a network *F*, an input *x*<sup>0</sup> and a set Ω of perturbed inputs of *x*0, *F* is called locally robust with respect to *x*<sup>0</sup> and Ω if *F* classifes all the perturbed inputs in Ω to the same label as it does *x*0.

Defnition 1 (Local Robustness [17]). *A neural network F* : R*<sup>u</sup>* → R*<sup>r</sup> is called locally robust with respect to an input x*<sup>0</sup> *and a set* Ω *of perturbed inputs of x if* ∀*x* ∈ Ω, Φ*F*(*x*) = Φ*F*(*x*0) *holds.*

Usually, the set Ω of perturbed inputs is defned by an *p*-norm ball around *x*<sup>0</sup> with a radius of , i.e., B*p*(*x*0, ) := {*x* | *x* − *x*0*<sup>p</sup>* ≤ } [17,2].

#### 2.2 Occlusion Perturbation

In the context of image classifcation networks, occlusion is a kind of perturbation that blocks the pixels in certain areas before the image is fed into the network. Existing studies showed that the classifcation accuracy of neural networks could be signifcantly decreased when the input objects are artifcially occluded [23,44].

Occlusions can have various occlusion shapes, sizes, colors, and positions. The shapes can be square, rectangle, triangle, or irregular shape. The size is measured by the number of occluded pixels. The occlusion color specifes the colors occluded pixels can take. The coloring of an occlusion can be either uniform, where all occluded pixels share the same color, or multiform, where these colors can vary in the range of [−, ], where specifes the threshold between an occluded pixel's value and its original value.

Prior studies [8,3] showed that both the uniform and multiform occlusions could cause misclassifcation to neural networks. Fig. 2 shows two examples of multiform and uniform occlusions, respectively. The traffic sign for "70km/h speed limit" in Fig. 2(a) is misclassifed to "30km/h" by adding a 5 × 5 multiform occlusion. Fig. 2(d) shows another sign, with different light conditions, where a 3 × 3 uniform occlusion (in Fig. 2(c)) causes the sign to be misclassifed to "30km/h".

The occlusion position is another aspect of defning occlusions. An occlusion can be placed precisely on the pixels of an image, or between a pixel and its neighbors. Fig. 3 shows an example, where the dots represent image

Fig. 3: An example occlusion on a 5 × 5 image at real number position.

pixels and the circles are the occluding pixels that will substitute the occluded ones. We say that an occlusion pixel <sup>ϑ</sup>*<sup>i</sup>* ′ , *j* ′ at location (*i* ′ , *j* ′ ) surrounds an image pixel *<sup>p</sup><sup>i</sup>*, *<sup>j</sup>* at location (*i*, *<sup>j</sup>*) if and only if <sup>|</sup>*<sup>i</sup>* <sup>−</sup> *<sup>i</sup>* ′ <sup>|</sup> < 1 and <sup>|</sup> *<sup>j</sup>* <sup>−</sup> *<sup>j</sup>* ′ <sup>|</sup> < 1. Note that *<sup>i</sup>* ′ , *j* ′ are real numbers, representing the location where the occlusion pixel *o* is placed on the image. An image pixel can be occluded by the substitute occlusion pixels if the occlusion pixels surround the image pixel.

There are at most four surrounding occlusion pixels for each image pixel, as shown in Fig. 3. Let I*<sup>p</sup>* be the set of the locations where the surrounding occlusion pixels of *p* are placed. After the occlusion, the value of pixel *<sup>p</sup><sup>i</sup>*, *<sup>j</sup>* is altered to the new one denoted by *p* ′ *i*, *j* , which can be computed by interpolation [19,22] such as next neighbour interpolation or Bi-linear interpolation based on occlusion pixels in I*p*. Besides that, we use a method based on *L*1-distance to calculate how much a pixel is occluded. Since the *L*1-distance of two adjacent pixels is 1, a surrounding occlusion pixel should not afect the image pixel if their *<sup>L</sup>*1-distance is greater than 1. The formula *max*(0, (|<sup>1</sup> <sup>−</sup> *<sup>i</sup>* ′ + *i*|) + (1 − *j* ′ + *j*) − 1) indicates how much an image pixel at (*i*, *<sup>j</sup>*) is occluded by an occlusion pixel at (*<sup>i</sup>* ′ , *j* ′ ). For instance, occlusion pixel at (*i* ′ , *j* ′ ) <sup>=</sup> (0.9, <sup>0</sup>.9) has no efect to image pixel (*i*, *<sup>j</sup>*) <sup>=</sup> (0, 0) since their *<sup>L</sup>*1-distance is larger than 1. Therefore, the occlusion factor *<sup>s</sup><sup>i</sup>*, *<sup>j</sup>* for pixel *<sup>p</sup>* at (*i*, *<sup>j</sup>*) can be calculated based on all surrounding occlusion pixels in <sup>I</sup>*<sup>p</sup>* as:

$$s\_{i,j} = \max(0, \sum\_{i'\_0, j' \in \mathbb{I}\_p} (|1 - j + j'|) + \sum\_{i', j'\_0 \in \mathbb{I}\_p} (|1 - i' + i|) - 1) \tag{1}$$

where (*i* ′ 0 , *j* ′ 0 ) is the frst element of I*p*. Notably, *s* is 1 for completely occluded pixel and 0 for the pixel that is not occluded, otherwise *<sup>s</sup>* has a value between (0, 1). Also, it is a special case for Equation 1 when (*i* ′ , *j* ′ ) are integers, where *s* can be reduced to 0 or 1.

## 3 The Occlusion Robustness Verifcation Problem

Let R *<sup>m</sup>*×*<sup>n</sup>* be the set of images whose height is *<sup>m</sup>* and width is *<sup>n</sup>*. We use <sup>N</sup><sup>1</sup>,*<sup>m</sup>* (*resp.* <sup>N</sup><sup>1</sup>,*<sup>n</sup>*) to denote the set of all the natural numbers ranging from 1 to *<sup>m</sup>* (*resp. <sup>n</sup>*). A coloring function ζ : <sup>R</sup> *<sup>m</sup>*×*<sup>n</sup>* × R × R → R is a mapping of each pixel of an image to its corresponding color value. Given an image *x* ∈ R *m*×*n* , ζ(*x*, *<sup>i</sup>*, *<sup>j</sup>*) defnes the value to color the pixel of *<sup>x</sup>* at (*i*, *<sup>j</sup>*).

Defnition 2 (Occlusion function). *Given a coloring function* ζ *and an occlusion* ϑ *of size <sup>w</sup>* <sup>×</sup> *h, the occlusion function is defned as function* <sup>γ</sup>ζ,*w*×*<sup>h</sup>* : <sup>R</sup> *<sup>m</sup>*×*<sup>n</sup>* × R × R → R *m*×*n such that x*′ <sup>=</sup> <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*) *if for all i* <sup>∈</sup> <sup>N</sup><sup>1</sup>,*<sup>n</sup> and j* <sup>∈</sup> <sup>N</sup><sup>1</sup>,*<sup>m</sup>, there is,*

$$\mathbf{x}\_{i,j}^{\prime} = \mathbf{x}\_{i,j} - \mathbf{s}\_{i,j} \times (\mathbf{x}\_{i,j} - \boldsymbol{\zeta}(\mathbf{x}, i, j)),\tag{2}$$

$$\text{where, } \zeta(\mathbf{x}, i, j) = \frac{\sum\_{(\vec{r}, \vec{j}') \in \mathbb{I}\_{\mathbf{r}\_{i,j}}} \theta\_{\vec{r}', \vec{j}'} \sqrt{(i - i')^2 + (j - j')^2}}{\sum\_{(\vec{r}, \vec{j}') \in \mathbb{I}\_{\mathbf{r}\_{i,j}}} \sqrt{(i - i')^2 + (j - j')^2}}. \tag{3}$$

*<sup>s</sup>* in Equation <sup>2</sup> is the occlusion factor for pixel at (*i*, *<sup>j</sup>*) as mentioned in Sec. 2.2. Note that when *i* ′ , *j* ′ are integers, Equation <sup>2</sup> can be reduced to *<sup>x</sup><sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> <sup>ϑ</sup>*<sup>i</sup>*, *<sup>j</sup>* , which represents that *<sup>x</sup><sup>i</sup>*, *<sup>j</sup>* is completely occluded by the occlusion. In other words, the integer case is a special case of the real number case. Also, when pixel at (*i*, *<sup>j</sup>*) is not occluded, since *<sup>s</sup><sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> 0. In this case, Equation <sup>2</sup> can be reduced to *<sup>x</sup>* ′ <sup>=</sup> *<sup>x</sup><sup>i</sup>*, *<sup>j</sup>* .

*i*, *j* Interpolation is handled by ζ showed in Equation 3. It shows the standard form for the color of the new *x* ′ *i*, *j* . A unique color value is specifed for all the pixels in the occluded area for a uniform occlusion. Therefore, ζ in Equation <sup>3</sup> can be reduced to ζ(*x*, *<sup>i</sup>*, *<sup>j</sup>*) <sup>=</sup> µ for some µ <sup>∈</sup> [0, 1]. The coloring function in a multiform occlusion is defned as <sup>ζ</sup>(*x*, *<sup>i</sup>*, *<sup>j</sup>*) <sup>=</sup> *<sup>x</sup><sup>i</sup>*, *<sup>j</sup>* <sup>+</sup> <sup>∆</sup>*<sup>p</sup>* with <sup>∆</sup>*<sup>p</sup>* <sup>∈</sup> [−ϵ, ϵ], where <sup>ϵ</sup> <sup>∈</sup> <sup>R</sup> defnes the threshold that a pixel can be altered.

Defnition 3 (Local occlusion robustness). *Given a DNN F* : R *<sup>m</sup>*×*<sup>n</sup>* → R *r , an occlusion function* <sup>γ</sup>ζ,*w*×*<sup>h</sup>* : <sup>R</sup> *<sup>m</sup>*×*<sup>n</sup>* ×R×R → R *<sup>m</sup>*×*<sup>n</sup> with respect to coloring function* ζ *and occlusion size <sup>w</sup>* <sup>×</sup> *h, and an input image x, <sup>F</sup> is called local occlusion robust on <sup>x</sup> with* <sup>γ</sup>ζ,*w*×*<sup>h</sup> if* <sup>Φ</sup>*<sup>F</sup>*(*x*) <sup>=</sup> <sup>Φ</sup>*<sup>F</sup>*(γζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*)) *holds for all* <sup>1</sup> <sup>≤</sup> *<sup>a</sup>* <sup>≤</sup> *n and* <sup>1</sup> <sup>≤</sup> *<sup>b</sup>* <sup>≤</sup> *m.*

Intuitively, Defnition <sup>3</sup> means that *<sup>F</sup>* is robust on *<sup>x</sup>* against the occlusions of <sup>γ</sup>ζ,*w*×*h*, if on any occluded image of *<sup>x</sup>* by the occlusion function <sup>γ</sup>ζ,*w*×*h*, *<sup>F</sup>* always returns the same classifcation result as on the original image *<sup>x</sup>*. Depending on the coloring function ζ, the defnition applies to various occlusions concerning shapes, colors, sizes, and positions. We can also extend the above defnition to the global occlusion robustness if *F* is robust on all images concerning <sup>γ</sup>ζ,*w*×*h*.

We prove that even for the case of uniform occlusion, a special case of the multiform one, the local occlusion robustness verifcation problem is NP-complete on the ReLUbased neural networks.

## 4 SMT-Based Occlusion Robustness Verifcation

#### 4.1 A Naïve SMT Encoding Method

The verifcation problem of FNNs' local occlusion robustness can be straightforwardly encoded into an SMT problem. In Defnition 3, we assume that *<sup>x</sup>* is classifed by Φ to the label <sup>ℓ</sup>*<sup>q</sup>*, i.e., <sup>Φ</sup>(*x*) <sup>=</sup> <sup>ℓ</sup>*<sup>q</sup>*, for a label <sup>ℓ</sup>*<sup>q</sup>* <sup>∈</sup> *<sup>L</sup>*. To prove *<sup>F</sup>* is robust on *<sup>x</sup>* after *<sup>x</sup>* is occluded by occlusion ϑ with size *<sup>w</sup>* <sup>×</sup> *<sup>h</sup>*, it sufces to prove that *<sup>F</sup>* classifes every occluded image *x* ′ <sup>=</sup> <sup>γ</sup>ζ,*w*×*h*(*a*, *<sup>b</sup>*) to <sup>ℓ</sup>*<sup>q</sup>* for all 1 <sup>≤</sup> *<sup>a</sup>* <sup>≤</sup> *<sup>n</sup>* and 1 <sup>≤</sup> *<sup>b</sup>* <sup>≤</sup> *<sup>m</sup>*. This is equivalent to proving that the following constraints are not satisfable:

$$\begin{cases} 1 \le a \le n, \, 1 \le b \le m, \\ \dots \end{cases} \tag{4}$$

$$\begin{aligned} \bigwedge\_{l \in \mathbb{N}\_{l,n}, j \in \mathbb{N}\_{l,n}} \\ \left( ((a-1 < i < a+w+1) \land (b-1 < j < b+h+1) \land \mathbf{x}'\_{i,j} = \gamma\_{\xi, w \times h}(\mathbf{x}, a, b)\_{i,j} ) \lor \mathbf{x}'\_{i,j} \right) \\ ((i \ge a+w+1) \lor (i \le a-1) \lor (j \ge b+h+1) \lor (j \le b-1)) \land \mathbf{x}'\_{i,j} = \mathbf{x}\_{i,j} ) \right), \\ \bigvee\_{l \in \mathbb{N}\_{l,q-1} \cup \mathbb{N}\_{q+1,r}} F(\mathbf{x}')\_l \ge F(\mathbf{x}')\_q. \end{aligned} \tag{6}$$

The conjuncts in Eq. 5 defne that *x* ′ is an occluded instance of *x*, and the disjuncts in Eq. <sup>6</sup> indicate that, when satisfable, there exists some label <sup>ℓ</sup>*<sup>i</sup>* which has a higher probability than <sup>ℓ</sup>*<sup>q</sup>* to be classifed to. Namely, the occlusion robustness of *<sup>F</sup>* on *<sup>x</sup>* is falsifed, with *x* ′ being a witness of the non-robustness. Note that this naive encoding

Fig. 4: The workfow of encoding and verifying FNN's robustness against occlusions.

considers the occlusion position's real number cases since function γ implicitly includes the interpolation.

Although the above encoding is straightforward, solving the encoded constraints is experimentally beyond the reach of general-purpose existing SMT solvers due to the piece-wise linear ReLU activation functions in the defnition of *F* in the constraints of Eq. 6, and the large search space *m* × *n* × (2) *<sup>w</sup>*×*<sup>h</sup>* (see Experiment II in Sec. 5).

## 4.2 Our Encoding Approach

An Overview of the Approach. To improve efficiency, we propose a novel approach for encoding occlusion perturbations into four layers of neurons and concatenating the original network to these so-called *occlusion layers*, constituting a new neural network which can be efficiently verifed using state-of-the-art, SMT-based verifers.

Fig. 4 shows the overview of our approach. Given an input image and an occlusion, we frst construct a 3-hidden-layer occlusion neural network (ONN) and then concatenate it to the original FNN by connecting the ONN's output layer to the FNN's input layer. The combined network represents all possible occluded inputs and their classifcation results. The robustness of the constructed network can be verifed using the existing SMT-based neural network verifers.

We introduce two acceleration techniques to speed up the verifcation further. First, we divide the occlusion space into several smaller, orthogonal spaces, and verify a fnite set of sub-problems on the smaller spaces. Second, we employ the eager falsifcation technique [14] to sort the labels according to their probabilities of being misclassifed to. The one with a larger probability is verifed earlier by the backend tools. Whenever a counterexample is returned, an occluded image is found such that its classifcation result differs from the original one. If all sub-problems are verifed and no counterexamples are found, the network is verifed robust on the input image against the provided occlusion.

Encoding Occlusions as Neural Networks. Given a coloring function ζ, an occlusion size *w*×*h* and an input image *x* of size *m* ×*n*, we construct a neural network *O* : R<sup>4</sup>+*ct* → R*<sup>m</sup>*×*<sup>n</sup>* to encode all the possible occluded images of *x*, where *c* = 1 if *x* is a grey image and *c* = 3 if *x* is an RGB image, *t* = 0 for the uniform occlusion and *t* = *w* × *h* for the multiform one.

Fig. 5 shows the neural network architecture for encoding occlusions. We divide it into a fundamental part and an additional part. The former encodes the occlusion position and the uniform occlusion color. The additional part is needed only by the multiform occlusion to encode the coloring function. Without loss of generality, we assume that the input layer takes the vector (*a*, *w*, *b*, *h*, ζ), where (*a*, *b*) is the top-left coordinate of occlusion area in *x*. The coloring function ζ is admitted by other *c* × *t* neurons in the input layer when the occlusion is multiform.

*(1) Encoding occlusion positions.* We explain the weights and biases that are defned in the neural network to encode the occlusion position. On the connections between the input layer and the frst hidden layer, the weights in matrices *W*1,1, *W*1,<sup>2</sup> and *W*1,<sup>3</sup> are 1, -1 and -1, respectively. Note that we hide all the edges whose weights are 0 in the fgure for clarity. The biases in b1,<sup>1</sup> are (−1, −2,..., −*m*) for the frst *m* neurons on the frst hidden layer. Those in b1,<sup>2</sup> are (2, 3,..., *m* + 1). The weights in *W*1,4, *W*1,5, *W*1,<sup>6</sup> and the biases in b1,<sup>3</sup> and b1,<sup>4</sup> are defned in the same way. We omit the details due to the page limitation.

For the second layer, the diagonals of weight matrices *W*2,<sup>1</sup> to *W*2,<sup>4</sup> are set to -1, and the rest of their entries are 0. The biases in b2,<sup>1</sup> and b2,<sup>2</sup> are 1. After the prop-

Fig. 5: An occlusion neural network for the occlusions on an image *x* with ζ and *w* × *h*.

agation to the second hidden layer, a pixel at position (*i*, *j*) in the image *x* is occluded if and only if both the outputs of the *i th* neuron in the frst *m* neurons and the *j th* neuron in the remaining *n* neurons on the second hidden layer are 1.

The third hidden layer represents the occlusion status of each pixel in the original image *x*. 2*n* weight matrices connect the second layer and the *n* × *m* neurons of the third layer. For example, we consider the weights in *W*3,*<sup>i</sup>* and *W*3,*n*+*<sup>i</sup>* which connect the *i th* group of *m* neurons in the third layer to the second layer. The size of *W*3,*<sup>i</sup>* is *m* × *m*, and the weights in the *i th* row are 1 while the rest is 0. The size of *W*3,*n*+*<sup>i</sup>* is *m* × *n*. The weights on its diagonal are set to 1, while the rest are set to 0. All the biases in b3,<sup>1</sup> to b3,*<sup>n</sup>* are -1. The output of the third layer indicates the occlusion status of all the pixels. If a pixel at (*i*, *j*) is occluded, then the output of the (*i* × *m* + *j*) *th* neuron in the third layer is 1, and otherwise, 0.

*(2) Encoding Coloring Functions.* We consider the uniform and multiform coloring functions separately for verifcation efficiency, although the former is a special case of the latter. We frst consider the general multiform case. In the multiform case, we introduce 2 × *n* × *m* extra neurons in the third hidden layer, as shown in the bottom part of Fig. 5. These neurons can be combined with the third layer, but it would be more clear to separate them. The weight matrix *W*3,ζ connects the third layer to these neurons, with its frst half of diagonal set to 1, and the second half set to -1. This helps retain the sign of the input ζ during propagation. The weight matrix *W*<sup>ζ</sup> connects the input ζ to these neurons, whose diagonal are 1, and the biases b<sup>ζ</sup> are -1. These neurons work just like the third layer, except that they not only represent the occlusion status of pixels, but also preserve the input ζ. If a pixel at (*i*, *j*) is occluded and ζ has a positive value, then the (*i* × *m* + *j*) *th* output in the frst half of them is ζ. The (*i* × *m* <sup>+</sup> *j*) *th* output in the second half is ζ when ζ has a negative value. Otherwise, the output is 0. In the uniform case, it can be encoded together with input images, and we thus explain it in the following paragraph.

*(3) Encoding Input Images.* In the fourth layer, we use *W*<sup>4</sup> to denote the weight matrix connecting the third layer. *W*<sup>4</sup> is used to encode pixel values of the input image *x* and the coloring function ζ of occlusions. In the uniform case, the weight w(*i*, *i*) in the diagonal of *W*<sup>4</sup> is w(*i*, *i*) = ζ*<sup>i</sup>* − *xi* and the biases b<sup>4</sup> = x where x is the fattened vector of the original input image. In the multiform case, the weight matrix *W*4,ζ connects the neurons in the bottom part that preserves information of input ζ to the fourth layer. The frst half of *W*4,ζ is identical to *W*4, and the second half of *W*4,ζ has its diagonal set to -1. It provides the value of the coloring function ζ with any sign for each occluded pixel. The output of the *j th* neuron in the *i th* group of the fourth layer is the raw pixel value plus ζ if the pixel at (*i*, *j*) is occluded; otherwise, it is the raw pixel value of *p*.

An Illustrative Example. We show an example of constructing the occlusion network on a 2 × 2, single-channel image in Fig. 6. In this example, we assume that the input image is *x* = [0.4, 0.6, 0.55, 0.72] and the occlusion applied to *x* has a size of 1 × 1, which means *w* = 1 and *h* = 1. For uniform occlusion, the coloring function ζ

has a fxed value of 0, and for multiform case, the threshold that a pixel can be altered is 0.1.

We suppose the occlusion is applied at position (1, 2), which means *a* = 1 and *b* = 2 for the input of occlusion network. In the forward propagation, we calculate the output of the frst layer by *a* × *W*1,<sup>1</sup> + b1,<sup>1</sup> and *a* × *W*1,<sup>2</sup> + *b* × *W*1,<sup>3</sup> + b1,<sup>2</sup> and can get (0, 0, 0, 1) for the frst four neurons. Following the same process, we get the output of the second 4 neurons, (1, 0, 0, 0). After propagation to the second layer, it outputs (1, 0),(0, 1) based on *W*2,1, *W*2,<sup>2</sup> and b2, representing the second column and the frst

Fig. 6: An example of encoding a one-pixel uniform occlusion as a neural network.

row of *<sup>x</sup>* is under occlusion. Likely, the third layer outputs (0, <sup>1</sup>, <sup>0</sup>, 0) based on its weight matrices and biases, representing that the second pixel in the frst row is occluded. After propagation to the fourth layer, the occlusion network outputs an occluded image *x* ′ <sup>=</sup> [0.4, <sup>0</sup>, <sup>0</sup>.55, <sup>0</sup>.72] based on *<sup>W</sup>*<sup>4</sup> and <sup>b</sup>4. It is identical to the expected occluded image, where the second pixel is occluded, and other pixels stay unchanged. Suppose we change *a* to some real number, for instance, 1.5. After the same propagation, we will get an output of (0, <sup>0</sup>.5, <sup>0</sup>, <sup>0</sup>.5) in the third layer, representing that the neurons in the second column are afected by the occlusion by a factor of 0.5. The fourth layer then outputs [0.4, <sup>0</sup>.3, <sup>0</sup>.55, <sup>0</sup>.36], which is the corresponding occluded image *<sup>x</sup>* ′ .

In the multiform case, as mentioned at the frst, we suppose the threshold ϵ <sup>=</sup> <sup>0</sup>.1, and keep all other settings. Then after the same propagation to the third layer, the third layer would output (0, <sup>1</sup>, <sup>0</sup>, 0), representing that the second pixel is occluded. Those extra neurons then output (0, <sup>0</sup>.1, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>, 0) where the second neuron in the frst half is <sup>0</sup>.1 and 0 for the remaining. This indicates both that the second pixel in the frst row is occluded, and has an epsilon of 0.1. After propagation to the fourth layer, the occlusion network outputs *x* ′ <sup>=</sup> [0.4, <sup>0</sup>.7, <sup>0</sup>.55, <sup>0</sup>.72] based on its *<sup>W</sup>*<sup>4</sup> and <sup>b</sup>4. As expected, the second pixel is occluded and increases by 0.1, and other pixels stay unchanged. For the case of a negative ϵ of <sup>−</sup>0.1, the extra neurons output (0, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>, <sup>0</sup>.1, <sup>0</sup>, 0). Note that the second neuron in the second half is 0.1 and the remaining are 0, which helps retain the sign of <sup>−</sup>0.1. The fourth layer then outputs [0.4, <sup>0</sup>.5, <sup>0</sup>.55, <sup>0</sup>.72], which is the expected occluded image where the second pixel decreases by 0.1.

#### 4.3 The Correctness of the Encoding

Given an input image *<sup>x</sup>*, a rectangle occlusion of size *<sup>w</sup>* <sup>×</sup> *<sup>h</sup>*, and a coloring function ζ, let *O* be the corresponding occlusion neural network constructed in the approach above. Let *F* be the FNN to verify. We concatenate *O* to *F* by connecting *O*'s output layer to *F*'s input layer. The combined network implements the composed function *F* ◦ *O*. The problem of verifying the occlusion robustness of *F* on the input image *x* is reduced to a regular robustness verifcation problem of *F* ◦ *O*.

Theorem 1 (Correctness). *An FNN F is robust on the input image x with respect to a rectangle occlusion in the size of <sup>w</sup>* <sup>×</sup> *<sup>h</sup> and a coloring function* ζ *if and only if* Φ*<sup>F</sup>*◦*O*((*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)) <sup>=</sup> Φ*<sup>F</sup>*(*x*) *for all* <sup>1</sup> <sup>≤</sup> *<sup>a</sup>* <sup>≤</sup> *n and* <sup>1</sup> <sup>≤</sup> *<sup>b</sup>* <sup>≤</sup> *m.*

Theorem 1 means that all the occluded images from *x* are classifed by *F* to the same label as *x*, which implies the correctness of our proposed encoding approach. To prove Theorem 1, it sufces to show that the encoded occlusion neural network represents all the possible occluded images. In other words, when being perceived as a function, the network outputs the same occluded image as the occlusion function for the same occlusion coordinate (*a*, *<sup>b</sup>*), as formalized in the following lemma.

Lemma 1. *Given an occlusion function* <sup>γ</sup>ζ,*w*×*<sup>h</sup>* : <sup>R</sup> *<sup>m</sup>*×*<sup>n</sup>* × R × R → R *<sup>m</sup>*×*<sup>n</sup> and an input image x, let <sup>O</sup>*γ,*<sup>x</sup>* : <sup>R</sup> <sup>4</sup>+*ct* → R *<sup>m</sup>*×*<sup>n</sup> be the corresponding occlusion neural network. There is* <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*) <sup>=</sup> *<sup>O</sup>*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ) *for all* <sup>1</sup> <sup>≤</sup> *<sup>a</sup>* <sup>≤</sup> *n and* <sup>1</sup> <sup>≤</sup> *<sup>b</sup>* <sup>≤</sup> *m.*

*Proof (Sketch).* It sufces to prove <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*)*<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> *<sup>O</sup>*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)*<sup>i</sup>*, *<sup>j</sup>* for all *<sup>i</sup>* <sup>∈</sup> <sup>N</sup><sup>1</sup>,*<sup>n</sup>* and *<sup>j</sup>* <sup>∈</sup> <sup>N</sup><sup>1</sup>,*<sup>m</sup>*. By Defnition 2, we consider the following two cases:

*Case 1: When a pixel <sup>p</sup> at position* (*i*, *<sup>j</sup>*) *is fully occluded, we have* <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*)*<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> <sup>ζ</sup>(*x*, *<sup>i</sup>*, *<sup>j</sup>*)*. We need to prove that O*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)*<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> <sup>ζ</sup>(*x*, *<sup>i</sup>*, *<sup>j</sup>*)*.*

Suppose *p* is covered by an arbitrary uniform occlusion with size of *w*<sup>0</sup> × *h*<sup>0</sup> at position (*a*<sup>0</sup>, *<sup>b</sup>*0). We can observe that for that pixel *<sup>p</sup>*, *<sup>i</sup>* <sup>&</sup>gt; *<sup>a</sup>*<sup>0</sup> <sup>∧</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>a</sup>*<sup>0</sup> <sup>+</sup> *<sup>w</sup>*<sup>0</sup> <sup>−</sup> 1 and *<sup>j</sup>* <sup>&</sup>gt; *<sup>b</sup>*<sup>0</sup> <sup>∧</sup> *<sup>j</sup>* <sup>&</sup>lt; *b*<sup>0</sup> + *h*<sup>0</sup> − 1 hold since *p* is covered by the occlusion.

We show the output of *<sup>O</sup>*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)*<sup>i</sup>*, *<sup>j</sup>* by inspecting the (*<sup>i</sup>* <sup>∗</sup> *<sup>n</sup>* <sup>+</sup> *<sup>j</sup>*) *th* output of the occlusion network after propagation, starting from inspecting the output of the *i th* and (*i* + *m*) *th* neurons of the frst layer. According to the network structure discussed in Sec. 4.2, we can tell that the *i th* neuron in the frst layer is 0 only when *<sup>i</sup>* > *<sup>a</sup>*0, the same property holds for the (*i* + *m*) *th* neuron when *<sup>i</sup>* <sup>&</sup>lt; *<sup>a</sup>*<sup>0</sup> <sup>+</sup> *<sup>w</sup>*<sup>0</sup> <sup>−</sup> 1. Therefore, the output for the *i th* and (*i* + *m*) *th* neurons of the frst layer is 0, which leads to the *i th* neuron in the frst part of the second layer has output of value 1. Through the similar process, we can get that the value of *z* (2) *j* in the second part of the second layer is also 1.

The (*i* × *n* + *j*) *th* neuron in the third layer is based on the *i th* neuron and *j th* neuron of the second layer that we just discussed. Therefore, the output of that neuron, *z* (3) *i*×*n*+*j* , is 1. For uniform occlusion, suppose the coloring function ζ has a fxed value µ<sup>0</sup>. By propagating the output *z* (3) *i*×*n*+*j* to the fourth layer, which is calculated as *W*<sup>4</sup> × *z* (3) + b4, the (*i* × *n* + *j*) *th* output of the fourth layer is 1 <sup>×</sup> (µ<sup>0</sup> <sup>−</sup> *<sup>p</sup><sup>i</sup>*, *<sup>j</sup>*) <sup>+</sup> *<sup>p</sup><sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> <sup>µ</sup><sup>0</sup>. Likely, for multiform occlusion, <sup>ζ</sup> indicates the threshold <sup>ϵ</sup><sup>0</sup> that a pixel can change. The (*<sup>i</sup>* <sup>×</sup> *<sup>n</sup>* <sup>+</sup> *<sup>j</sup>*) *th* extra neuron outputs <sup>ϵ</sup><sup>0</sup> , then the corresponding neuron in the fourth layer outputs *<sup>p</sup><sup>i</sup>*, *<sup>j</sup>* <sup>+</sup> <sup>ϵ</sup><sup>0</sup>.

This output of *<sup>O</sup>*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)*<sup>i</sup>*, *<sup>j</sup>* is identical to <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*)*<sup>i</sup>*, *<sup>j</sup>* , the expected pixel value at position (*i*, *<sup>j</sup>*), which also indicates that the color is correctly encoded.

*Case 2: When a pixel <sup>p</sup> at position* (*i*, *<sup>j</sup>*) *is not occluded, we have* <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*)*i*, *<sup>j</sup>* <sup>=</sup> *<sup>x</sup><sup>i</sup>*, *<sup>j</sup> . Then, we need to prove that O*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ)*<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> *<sup>x</sup><sup>i</sup>*, *<sup>j</sup> .*

In this case, we can observe that *<sup>i</sup>* <sup>&</sup>lt; *<sup>a</sup>*<sup>0</sup> <sup>∨</sup> *<sup>i</sup>* <sup>≥</sup> *<sup>a</sup>*<sup>0</sup> <sup>+</sup> *<sup>w</sup>*<sup>0</sup> and *<sup>j</sup>* <sup>&</sup>lt; *<sup>b</sup>*<sup>0</sup> <sup>∨</sup> *<sup>j</sup>* <sup>≥</sup> *<sup>b</sup>*<sup>0</sup> <sup>+</sup> *<sup>h</sup>*<sup>0</sup> hold for pixel *p*. Then We can tell that the corresponding neuron in the third layer outputs 0 and the output of the (*i* ∗ *n* + *j*) *th* neuron in the fourth layer is the origin pixel value of *p* following the similar process discussed in case 1.

For the occlusion with real number position, some more cases need to be discussed, but the proof has a very similar sketch as the normal occlusion with integer position. We leverage the equality of *a* × *b* = *exp*(*log*(*a*) + *log*(*b*)) and add it to the propagation between the third layer and those extra neurons only when the occlusion is at real number positions in the multiform case. And we use *ReLU*(*a* + *b* − 1) as an alternative to logarithms and exponents in implementation since Marabou does not support such operations. Due to the page limit, please refer to [15] for the details of the full proof.

Theorem 1 can be directly derived from Lemma 1 and Defnition 3 by substituting <sup>γ</sup>ζ,*w*×*h*(*x*, *<sup>a</sup>*, *<sup>b</sup>*) for *<sup>O</sup>*γ,*<sup>x</sup>*(*a*,*w*, *<sup>b</sup>*, *<sup>h</sup>*, ζ) in the defnition.

#### 4.4 Verifcation Acceleration Techniques

Existing SMT-based neural network verifcation tools can directly verify the composed neural network. The number of ReLU activation functions in the network is the primary factor in determining the verifcation time cost by the backend tools. In the occlusion part, the number of ReLU nodes is independent of the scale of the original networks to be verifed. Therefore, our approach's scalability relies only on the underlying tools.

To further improve the verifcation efciency, we integrate two algorithmic acceleration techniques by dividing the verifcation problem into small independent sub-problems that can be solved separately.

Occlusion Space Splitting. We observed that verifying the composed neural network with a large input space can signifcantly degrade the efciency of backend verifers. Even for small FNNs with only tens of ReLUs, the verifers may run out of time due to the large occlusion space for searching. For instance, the complexity of Reluplex [20] can be derived from the original SMT method of Simplex [32]. It has a complexity of Ω(*<sup>v</sup>* <sup>×</sup> *<sup>m</sup>* <sup>×</sup> *<sup>n</sup>*), where *<sup>m</sup>* and *<sup>n</sup>* represent the number of constraints and variables, and *<sup>v</sup>* represents the number of pivots operated in the Simplex method. In the worst case, *v* can grow exponentially. Reduction in the search space can reduce the number of pivot operations, therefore signifcantly improving verifcation efciency.

Based on the above observation, we can divide [1, *<sup>m</sup>*] (*resp.* [1, *<sup>n</sup>*]) into *<sup>k</sup><sup>m</sup>* <sup>∈</sup> <sup>Z</sup> + (*resp. k<sup>n</sup>* ∈ Z + ) intervals [*m*<sup>0</sup>, *<sup>m</sup>*1], . . . ,[*m<sup>k</sup>m*−<sup>1</sup>, *<sup>m</sup><sup>k</sup><sup>m</sup>* ] (*resp.* [*n*<sup>0</sup>, *<sup>n</sup>*1], . . . ,[*n<sup>k</sup>n*−<sup>1</sup>, *<sup>n</sup><sup>k</sup><sup>n</sup>* ]) and verify the problem on the Cartesian product of the two sets of intervals.

$$\begin{aligned} \forall \mathbf{x'} \in \mathbb{X}. \Phi(\mathbf{x'}) = \Phi(\mathbf{x}) \equiv \bigwedge\_{(i,j)=(0,0)}^{(k\_n-1,k\_n-1)} \forall \mathbf{x'} \in \mathbb{X}\_{(i,j)}, \Phi(\mathbf{x'}) = \Phi(\mathbf{x}), \text{ where} \\ \mathbf{X} = \bigcup\_{(i,j)=(0,0)}^{(k\_n-1,k\_n-1)} \mathbf{X}\_{(i,j)} = \bigcup\_{(i,j)=(0,0)}^{(k\_n-1,k\_n-1)} \{\gamma\_{\zeta,n\times h}(\mathbf{x},a,b) | m\_i \le a \le m\_{i+1}, n\_j \le b \le n\_{j+1}\}. \end{aligned} \tag{7}$$

In this way, we split the occlusion space into *k<sup>m</sup>* × *k<sup>n</sup>* sub-spaces. It is equivalent to prove ∀*x* ′ <sup>∈</sup> <sup>X</sup>.Φ(*<sup>x</sup>* ′ ) for all <sup>X</sup>(*i*, *<sup>j</sup>*) with 0 <sup>≤</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>k</sup><sup>m</sup>* and 0 <sup>≤</sup> *<sup>j</sup>* <sup>&</sup>lt; *<sup>k</sup>n*, without losing the soundness and completeness. We call each verifcation instance a *query*, which can be solved more efciently than the one on the whole occlusion space by backend verifers. Furthermore, another advantage of occlusion space splitting is that these divided queries can be solved in parallel by leveraging multi-threaded computing.

Eager Falsifcation by Label Sorting. Another *Divide* & *Conquer* approach for acceleration is to divide the verifcation problem into independent sub-problems by the classifcation labels in *L*, as defned below:

$$\forall \mathbf{x'} \in \mathbb{X}. \Phi(\mathbf{x'}) = \Phi(\mathbf{x}) \equiv \forall \mathbf{x'} \in \mathbb{X}. \bigwedge\_{\ell' \in L} \Phi(\mathbf{x}) = \ell' \lor \Phi(\mathbf{x'}) \neq \ell'. \tag{8}$$

The dual problem to disprove the robustness can be solved to fnd some label ℓ ′ such that Φ(*x*) , ℓ ′ <sup>∧</sup> Φ(*<sup>x</sup>* ′ ) <sup>=</sup> ℓ ′ . We can frst solve those that have higher probabilities of being non-robust. Once a sub-problem is proved non-robust, the verifcation terminates, with no need to solve the remainder. Such approach is called *eager falsifcation* [14]. Based on this methodology, we sort the sub-problems in a descent order according to the probabilities at which the original image is classifed to the corresponding labels by the neural network. A higher probability implies that the image is more likely to be classifed to the corresponding label. Heuristically, there is a higher probability of fnding


Table 1: Occlusion verifcation results on two medium FNNs trained on MNIST and

\* - / +: the numbers of non-robust and robust cases; *T*<sup>+</sup> (*resp. T*−): average verifcation time in robust (*resp.* non-robust) cases; *T*build: the building time of occlusion neural networks; TO (%): the percentage of runtime-out cases among all the queries.

an occlusion such that the occluded image is misclassifed to that label. We sequence the queries into backend verifers until all are verifed, or a non-robust case is reported. Our experimental results will show that this approach can achieve up to 8 and 24 times speedup in the robust and non-robust cases, respectively.

## 5 Implementation and Evaluation

We implemented our approach in a Python tool called OccRob, using the PyTorch framework. As a backend tool, we chose the Marabou [21] state-of-the-art, SMT-based DNN verifer. We evaluated our proposed approach extensively on a suite of benchmark datasets, including MNIST [24] and GTSRB [16]. The size of the networks trained on the datasets for verifcation is measured by the number of ReLUs, ranging from 70 to 1300. All the experiments are conducted on a workstation equipped with a 32 core AMD Ryzen Threadripper CPU @ 3.7GHz and 128 GB RAM and Ubuntu 18.04. We set a timeout threshold of 60 seconds for a single verifcation task. All code and experimental data, including the models and verifcation scripts can be accessed at https://github.com/MakiseGuo/OccRob.

We evaluate our proposed method concerning efciency and scalability in the occlusion robustness verifcation of ReLU-based FNNs. Our goals are threefold:


Experiment I: Efectiveness. We frst evaluate the efectiveness of OccRob in robustness verifcation against various types of occlusions of diferent sizes and color ranges. Table 1

shows the verifcation results and time costs against multiform occlusions on two medium FNNs trained on MNIST and GTSRB. We consider two occlusion sizes, 2 × 2 and 5 × 5, respectively. The occluding color range is from 0.05 to 0.40. In each verifcation task, we selected the frst 30 images from each of the two datasets and verifed the network's robustness around them, under corresponding occlusion settings. As expected, larger occlusion sizes and occluding color ranges imply more non-robust cases. One can see that OccRob can almost always verify and falsify each input image, except for a few time-outs. The robust cases cost more time than the non-robust cases, but all can be fnished in a few minutes. Note that the time overhead for building occlusion neural networks is almost negligible, compared with the verifcation time. The efectiveness against uniform occlusions is shown in the following experiment.

Fig. 7 shows several occlusive adversarial examples that are generated by OccRob under diferent occlusion settings. These occlusions do not alter the semantics of the original images and should be classifed to the same results as those non-occluded ones. However, they are misclassifed to other results.

Experiment II: Efciency improvement over the naive encoding method. We compare the efciency of OccRob with that of a naive SMT encoding approach on verifying uniform occlusions since the naive encoding approach cannot handle verifcation against multiform occlusions. We apply the same acceleration techniques, such as parallelization and a variant of input space splitting, to the naive approach, which otherwise times out for almost all verifcation tasks even on the smallest model.

Table 2 shows the average verifcation time on six FNNs of diferent sizes against uniform occlusions. We can observe that OccRob afords a signifcant improvement in efciency, up to 30 times higher than the naive approach. It can always fnish before the preset time threshold, while the naive method fails to verify the two large networks under the same time threshold. The timeout proportion of two medium networks is over 70%. While the small network on MNIST only has an 8% of timeout proportion with the naive method, OccRob barely timeouts on every network.

Table 2: Performance comparison between OccRob (OR) and the naive (NAI) methods on MNIST and GTSRB under diferent occlusion sizes.


Experiment III: Efectiveness of the integrated acceleration techniques. We fnally evaluate the efectiveness of the two acceleration techniques integrated with the tool. We evaluate each technique separately by excluding it from OccRob and comparing the verifcation time of OccRob and the corresponding excluded versions. Fig. 8 shows the experimental results of verifying the medium FNN trained on GTSRB against multiform occlusions by the tools. Fig. 8 (a) shows that label sorting can improve efciency in both robust and non-robust cases. In particular, the improvement is more signifcant in the non-robust case, with up to 5 times speedup in the experiment. That is because solving each query is faster than solving all simultaneously, and further OccRob immediately stops dispatching queries once a counterexample is found in the non-robust case. Fig. 8 (b) shows that occlusion space splitting can also signifcantly improve the efciency, with up to 8 and 24 times speedups in the robust and non-robust cases, respectively. In addition, Fig. 8 (b) also shows a signifcant reduction in the number of time-outs.

## 6 Related Work

Robustness verifcation of neural networks has been extensively studied recently, aiming at devising efcient methods for verifying neural networks' robustness against various types of perturbations and adversarial attacks. We classify those methods into two categories according to the type of perturbations, which can be semantic or non-semantic. Semantic perturbation has an interpretable meaning, such as occlusions and geometric transformations like rotation, while non-semantic perturbation means that noises perturb inputs with no particular meanings.

Non-semantic perturbations are usually represented as *L<sup>p</sup>* norms, which defne the ranges in which an input can be altered. Some robustness verifcation approaches for

Fig. 8: Efficiency evaluation results of the two acceleration techniques.

non-semantic perturbations are both sound and complete by leveraging SMT [20,1] and MILP (mixed integer linear programming) [36] techniques, while some sacrifce the completeness for better scalability by over-approximation [29,2,7], abstract interpretation [34,10,5], interval analysis by symbolic propagation [43,42,26], etc.

In contrast to a large number of works on non-semantic robustness verifcation, there are only a few studies on the semantic case. Because semantic perturbations are beyond the range of *Lp* norms [9], those abstraction-based approaches cannot be directly applied to verifying semantic perturbations. Mohapatra et al. [30] proposed to verify neural networks against semantic perturbations by encoding them into neural networks. Their encoding approach is general to a family of semantic perturbations such as brightness and contrast changes and rotations. Their approach for verifying occlusions is restricted to uniform occlusions at integer locations. Sallami et al.[31] proposed an interval-based method to verify the robustness against the occlusion perturbation problem under the same restriction. Singh et al. [35] proposed a new abstract domain to encode both non-semantic and semantic perturbations such as rotations. Chiang et al. [4] called occlusions *adversarial patches* and proposed a certifable defense by extending interval bound propagation (IBP) [12]. Compared with these existing verifcation approaches for semantic perturbations, our SMT-based approach is both sound and complete, and meanwhile, it supports a larger class of occlusion perturbations.

## 7 Conclusion and Future Work

We introduced an SMT-based approach for verifying the robustness of deep neural networks against various types of occlusions. An efficient encoding method was proposed to represent occlusions using neural networks, by which we reduced the occlusion robustness verifcation problem to a regular robustness verifcation problem of neural networks and leveraged *o*ff*-the-shelf* SMT-based verifers for the verifcation. We implemented a resulting prototype OccRob and intensively evaluated its effectiveness and efficiency on a series of neural networks trained on the public benchmarks, including MNIST and GTSRB. Moreover, as the scalability of DNN verifcation engines continues to improve, our approach, which uses them as blackbox backends, will also become more scalable.

As our occlusion encoding approach is independent of target neural networks, we believe it can be easily extended to other complex network structures, such as convolutional and recurrent ones, which only depend on the backend verifers. It would also be interesting to investigate how the generated adversarial examples could be used for neural network repairing [41,18] to train more robust networks.

## Acknowledgments

This work has been supported by National Key Research Program (2020AAA0107800), NSFC-ISF Joint Program (62161146001, 3420/21) and NSFC projects (61872146, 61872144), Shanghai Science and Technology Commission (20DZ1100300), Shanghai Trusted Industry Internet Software Collaborative Innovation Center and "Digital Silk Road" Shanghai International Joint Lab of Trustworthy Intelligent Software (Grant No. 22510750100).

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Neural Network-Guided Synthesis of Recursive List Functions

Naoki Kobayashi() and Minchao Wu

The University of Tokyo, Tokyo, Japan koba@is.s.u-tokyo.ac.jp

Abstract. Kobayashi et al. have recently proposed NeuGuS, a framework of neural-network-guided synthesis of logical formulas or simple program fragments, where a neural network is first trained based on sample data, and then a logical formula over integers is constructed by using the weights and biases of the trained network as hints. The previous method was, however, restricted the class of formulas of quantifier-free linear integer arithmetic. In this paper, we propose a NeuGuS method for the synthesis of recursive predicates over lists definable by using the left fold function. To this end, we design and train a special-purpose recurrent neural network (RNN), and use the weights of the trained RNN to synthesize a recursive predicate. We have implemented the proposed method and conducted preliminary experiments to confirm the effectiveness of the method.

## 1 Introduction

Kobayashi et al. [12] have recently proposed a framework called Neural-Network-Guided Synthesis ( NeuGuS) for the synthesis of quantifier-free logical expressions over integer variables, which may also be viewed as simple program expressions over integer variables. Given sample data (also called training data below, which consist of positive/negative samples and implication constraints [6] such as "if d<sup>1</sup> is a positive sample, so is d2, but it is unknown whether d<sup>1</sup> is indeed a positive sample), NeuGuS first trains a feed-forward neural network with respect to the sample data, and then constructs a logical expression on integers (more precisely, a Boolean combination of inequalities on integer variables) by using the weights and biases of the neural network as hints. The main characteristic of NeuGuS is its gray-box use of neural networks. NeuGuS first trains a neural network, but instead of directly using the trained network as a classifier, it tries to construct a simple logical expression by using the trained network as a hint. Advantages of the gray-box approach over the white-box approach of using the network itself as a classifier include: (i) if successful, a simple classifier is obtained that is easier to understand (for human beings) and verify (for computers), and (ii) we need not worry too much about overfitting; even if the trained network is overfit to the given sample data, we may still be able to extract useful information such as features important for the classification, and use them to construct a simple classifier. Kobayashi et al. [12,13] have applied the proposed framework to automated program verification, where NeuGuS is used to find program invariants, and also to program synthesis where, given a program sketch containing holes called oracles, NeuGuS is used to find program expressions to fill the holes.

In this paper, we extend NeuGuS to enable the synthesis of recursive predicates over Booleans, integers, and lists of Booleans, and lists of integers from positive/negative samples and implication constraints. For example, in the case of the synthesis of a sortedness predicate, the extended NeuGuS (henceforth, simply called NeuGuSR) takes as inputs sample data like:

```
sorted([1; 3; 4]) sorted([2; 5; 6; 7]) ¬sorted([3; 1; 4]) ¬sorted([5; 2; 7; 6])
sorted([1; 3; 5]) ⇒ sorted([1; 3; 5; 6]) · · · .
```
Here, sorted([1; 3; 5]) ⇒ sorted([1; 3; 5; 6]) means that if sorted([1; 3; 5]) is true, so is sorted([1; 3; 5; 6]). The goal of the synthesis is to construct a recursive program that satisfies the constraints specified by the sample data. In the case of the above example, we aim to construct a program (written in OCaml language: https://ocaml.org/) like:

```
let sorted l =
  let rec sorted_aux l b r =
    match l with [] -> b
               | x::l' -> sorted_aux l' (b && r <= x) x
  in sorted_aux l true 0
```
Here, the Boolean argument b of the auxiliary function sorted\_aux denotes whether the elements of the list read so far are sorted (in the ascending order), and the integer argument r keeps the last element read (which is initially set to 0; hence, the function sorted judges the sortedness of a list consisting of non-negative integers), to compare it with the next element. The recursive programs constructed with our method are restricted to those definable by using the left fold function. Note that the function sorted above can be expressed as foldl (λ(b, r).λx.(b ∧ r ≤ x, x)) (true, 0) using the left fold function foldl. 1

To synthesize recursive predicates, we first train a recurrent neural network (RNN), and construct a recursive program like the one above by using, as hints, the weights of the RNN and information about the executions of the RNN for the training data. We have designed a special-purpose RNN for that purpose, with the synthesis of recursive programs in mind. Figure 1 shows the overall structure of our RNN. The RNN has two kinds of inputs: Boolean lists and integer lists (where their elements are read one by one), and a Boolean output. The inputs and output correspond to those of the function to be synthesized, which takes m Boolean lists and n integer lists as arguments, and returns a Boolean value. Here, we assume that the lists are of equal length, by replicating integer arguments and padding short lists with dummy elements if necessary. For

<sup>1</sup> In fact, the program above is written so that it matches the computation of the left fold function. Otherwise, sorted\_aux could alternatively be defined so that it returns false immediately when r > x holds.

Fig. 1. The overall structure of the special-purpose RNN

example, if the argument of the function to be synthesized is ([1; 2; 3], 0, [1; 0]), then the input for RNN will be ([1; 2; 3], [0; 0; 0], [1; 0; −1]). The Boolean values true and false are respectively represented as 1 and −1. The RNN has also two kinds of hidden states: Booleans and integers. The Boolean hidden states are actually represented as numerical values, but they are constrained to range over [−1, 1] by using the hyperbolic tangent function tanh as the activation function for those values inside the feed-forward network. The details of the feed-forward network will be discussed later.

After training the RNN, by using (i) the weights and biases of each link/node and (ii) the the input/output behavior of the trained feed-forward network as hints, we construct a function:

$$step: \mathbf{B}^m \times \mathbf{Z}^n \times \mathbf{B}^h \times \mathbf{Z}^k \to \mathbf{B}^h \times \mathbf{Z}^k,$$

which takes the current input (consisting of m Booleans and n integers) and the current values of Boolean and integer hidden states, and returns the next hidden states. Here, Z and B are the types of integers and Booleans respectively. We then construct the whole program as the one that "folds" the input lists by using the step function, where the base-case values correspond to the initial values of the hidden states; more details are discussed in later sections. Finally, we check whether the synthesized program conforms to the sample data and if so, output the program; otherwise we retrain the RNN and retry the program synthesis.

We have implemented a program synthesis tool based on the above idea. We have confirmed through experiments that the tool worked reasonably well; our tool could successfully synthesize the sortedness predicate above, as well as other non-trivial predicates, including the binary predicate avge(`, n), which means that the average value of the elements in the list ` is no less than n.

The rest of this paper is structured as follows. Section 2 defines the program synthesis problem considered in this paper. Section 3 introduces our specialpurpose RNN. Section 4 explains how to synthesize a program from a trained RNN. Section 5 reports an implementation and experimental results. Section 6 discusses related work and Section 7 concludes the paper.

## 2 The Synthesis Problem

This section defines the problem of program synthesis considered in this paper. We write B and Z for the sets of Booleans and integers respectively. For a set S, we write S ∗ for the set of sequences consisting of elements of S, and S<sup>1</sup> × · · · × S<sup>k</sup> for the set of tuples of the form (v1, . . . , vk) with v<sup>i</sup> ∈ S<sup>i</sup> for each i. We sometimes call an element of S <sup>∗</sup> a list, based on the terminology used in programming languages, and write [a1; · · · ; an] instead of a<sup>1</sup> · · · an.

We assume a finite set of variables called predicate variables. A signature maps each predicate variable to its domain of the form T<sup>1</sup> × · · · × Tk, where T<sup>i</sup> ∈ {B, Z, B<sup>∗</sup> , Z <sup>∗</sup>}. For example, for a signature K and a predicate variable p, K(p) = Z <sup>∗</sup> × Z means that p is a binary predicate that takes an integer list and an integer as arguments.

For a signature K, we write Atoms<sup>K</sup> for the set of pairs (p, v) where v ∈ K(p); we often write p(v) for (p, v). An implication constraint is a formula of the form a<sup>1</sup> ∧ · · · ∧ a<sup>k</sup> ⇒ b<sup>1</sup> ∨ · · · ∨ b`, where a1, . . . , ak, b1, . . . , b` ∈ AtomsK. Let Θ be an interpretation for predicate variables, i.e., a map that assigns a predicate P ⊆ K(p) to each predicate p ∈ dom(K). We write Θ |= p(v) if v ∈ Θ(p). We write Θ |= a<sup>1</sup> ∧ · · · ∧ a<sup>k</sup> ⇒ b<sup>1</sup> ∨ · · · ∨ b` and say that Θ satisfies the implication constraint a<sup>1</sup> ∧ · · · ∧ a<sup>k</sup> ⇒ b<sup>1</sup> ∨ · · · ∨ b`, when Θ |= b<sup>j</sup> for some j ∈ {1, . . . , `} if Θ |= a<sup>i</sup> for every i ∈ {1, . . . , k}.

The synthesis problem considered in this paper is the problem of, given a signature K and a set of implication constraints as input, finding (a description of) a predicate assignment Θ that satisfies all the implication constraints. As a description of the predicate assigned to each predicate variable, we consider the class of functions f defined by programs of the following form:

$$\begin{aligned} \text{let } & f(\widetilde{x}:T\_1 \times \dots \times T\_n) = \\ & \quad \text{let } \mathbf{rec}\ g(y,\widetilde{r}) = \mathbf{match}\ y \text{ with} \\ & \quad \square \quad \rightsquigarrow\ r\_1 \\ & \quad \mid \ (u\_1, \dots, u\_n) :: y' \ \rightharpoonup \mathbf{let}\ \widetilde{r}' = step(u\_1, \dots, u\_n, \widetilde{r}) \text{ in } g(y', \widetilde{r}') \ \widetilde{r}' \\ & \quad \text{in } g(expr\_{T\_1 \times \dots \times T\_n}(\widetilde{x}), \widetilde{d}). \end{aligned}$$

Here, <sup>x</sup><sup>e</sup> denotes a sequence <sup>x</sup>1, . . . , xk, and <sup>d</sup>edenotes a sequence of default integer or Boolean values d1, . . . , d`, where d<sup>i</sup> is true or 0; we write d<sup>B</sup> for true and d<sup>Z</sup> for 0. <sup>2</sup> The function ezip is an extended "zip" function, which maps a tuple

<sup>2</sup> The use of fixed default values slightly restricts the class of functions. In fact, the value of f([ ]) is restricted to true. To remove the restriction, it suffices to either (i) allow <sup>d</sup><sup>e</sup> to take other values and make them also learnable, or (ii) replace <sup>r</sup><sup>1</sup> with h(r1) and make the Boolean function h also learnable.

consisting of lists, integers, and Booleans to a list of tuples. It is defined by:

$$\begin{cases} \begin{aligned} &ezip\_{T\_{1}\times\cdots\times T\_{n}}(v\_{1},\ldots,v\_{n})=\\ &\begin{cases} [] &\text{if every } v\_{i}\text{ is } [], \text{an integer, or a Boolean} \\ (ehd\_{T\_{1}}(v\_{1}),\ldots,ehd\_{T\_{1}}(v\_{n})) &\colon (etl\_{T\_{1}}(v\_{1}),\ldots,etl\_{T\_{n}}(v\_{n})) &\text{otherwise} \end{cases} \end{cases} \\ &\begin{aligned} &ehd\_{\mathbf{Z}\cdot\left[}([]) = -1 &ehd\_{\mathbf{Z}\cdot\left(n::v\right)}=n &ehd\_{\mathbf{B}\cdot\left([]\right)} = \mathbf{f}\mathbf{1}\mathbf{1}\mathbf{s}\mathbf{e} &\text{ehd\_{\mathbf{B}\cdot\left(b::v\right)}=b} \\ &ehd\_{\mathbf{Z}\cdot\left([]\right)}=n &ehd\_{\mathbf{B}}(b)=b & \textit{etil\\_2\left(n\right)}=n &etl\_{\mathbf{B}}(b)=b \\ &\textit{etil\\_2\cdot\left([]\right)}=\left[] &\textit{eilt}\_{\mathbf{Z}\cdot\left(n::v\right)}=v & \textit{eilt}\_{\mathbf{B}\cdot\left([]\right)} = \left[] & \textit{eilt}\_{\mathbf{B}\cdot\left(b::v\right)}=v. \end{aligned} \end{cases} \end{cases}$$

For example, ezipZ∗×Z∗×Z([1; 2; 3], [2; 3], 1) = [(1, 2, 1); (2, 3, 1); (3, −1, 1)]. The function step is the main target of the synthesis. It should be a function on integers and Booleans, consisting of (i) Boolean operations, (ii) affine expressions of the form c<sup>0</sup> + c1x<sup>1</sup> + · · · + ckx<sup>k</sup> and (iii) inequalities of the form e ≤ 0, where e is an affine expression. The function g above can also be expressed as

$$(\lambda \widetilde{x}.\#\_1(foldl \, step' \, (\bar{d}) \, (\, ezip\_{T\_1 \times \cdots \times T\_n}(\, \widehat{x}))),$$

where foldl is the left fold function, step<sup>0</sup> is the curried version of step, and #<sup>1</sup> denotes the projection of a tuple to its first element.

In the case of the sortedness predicate discussed in Section 1, T<sup>1</sup> = Z <sup>∗</sup> with <sup>k</sup> = 1, the length <sup>|</sup>ze<sup>|</sup> of the auxiliary parameters of <sup>g</sup> is <sup>2</sup>, and step : <sup>Z</sup>×B×<sup>Z</sup> <sup>→</sup> B × Z is given by step(u, rb, ri) = (r<sup>b</sup> ∧ (r<sup>i</sup> ≤ u), u).

For the predicate avge mentioned in Section 1, T<sup>1</sup> = Z <sup>∗</sup> and T<sup>2</sup> = Z with k = 2, and step : Z × Z × B × Z → B × Z is given by step(u1, u2, rb, ri) = (r<sup>i</sup> + u<sup>1</sup> − u<sup>2</sup> ≥ 0, r<sup>i</sup> + u<sup>1</sup> − u2). Here, during the computation of avge(`, m), the parameter z accumulates the sum of `<sup>i</sup> − m (where `<sup>i</sup> is the i-th element of `). Whether the average of the elements of ` is no less than m can be determined by checking whether the final value of z is no less than 0.

Our synthesis problem subsumes the problem of learning automata (which is obtained as a special case, where the signature consists of a single predicate p : (B<sup>∗</sup> ) <sup>m</sup> and step :B<sup>m</sup>+<sup>n</sup> → B<sup>n</sup>; input symbols and states are encoded as elements of B<sup>m</sup> and B<sup>n</sup> respectively) and also that of symbolic automatic relations [19]. In fact, the automatic synthesis of symbolic automatic relations was one of the motivations behind the present paper, as explained below.

The motivations for the synthesis problem above come from automated program verification and synthesis. For automated program verification, we have CHC-based program verification [1] in mind, where various program verification problems are reduced to the satisfiability problem for Constrained Horn Clauses (CHCs). For programs using lists, the CHCs obtained by the reduction involve predicates over lists, but the current CHC solvers [10,14,2] are not very good at solving such CHCs. A solver for the synthesis problem above can be used as an important component in a CHC solver [2,4] based on the ICE-learning framework [6], to synthesize a candidate solution for CHCs involving lists. Another application is the oracle-based programming mentioned in Section 1, whose goal is to synthesize code fragments to fill the holes of a given program pattern. By solving the synthesis problem above, we can automatically synthesize code

Fig. 2. The feed-forward network inside the RNN

fragments that involve recursive computation over lists. The roles of implication constraints in those applications are explained in [2,13].

In both of the applications above, the validity of a synthesized program is determined based on the whole verification or synthesis goal (in the case of verification, a synthesized predicate over lists is valid if it is indeed a solution for the CHC satisfiability problem). Thus, in the actual applications, the synthesis problem defined above needs to be repeatedly solved with the set of sample data being gradually expanded, until the end goal of program verification or synthesis is achieved.

## 3 The Design and Training of the RNN

This section describes the design of our special-purpose recursive neural network (RNN) tailored for our synthesis problem, and how to train it.

#### 3.1 The Architecture of the RNN

The overall structure of the RNN is as already depicted in Figure 1. The structure of the feed-forward (FF) network inside the RNN is shown in Figure 2. The network consists of four layers of nodes, where the first layer (the leftmost one) consists of input nodes of the FF network, which hold the input values and hidden state values of the whole RNN, and the fourth layer (the rightmost one) consists of output nodes of the FF network, which hold the next states of the RNN. The nodes of the diamond shape take values in the range [−1, 1] (either by the assumption on inputs or by the use of tanh as the activation function), and those of the circle shape take arbitrary floating point numbers. The value of each diamond-shaped node is computed by tanh(b + w1x<sup>1</sup> + · · · + wkxk) and that of each circle node is computed by b + w1x<sup>1</sup> + · · · + wkxk, where the bias b and the weight w<sup>i</sup> vary for each node and link. Each ⊗ node in the fourth layer has exactly two inputs x and y, and outputs <sup>x</sup>+1 2 y, where x is the output of the diamond-shaped node.

The part of the FF network to compute the diamond-shaped nodes in the fourth layer is analogous to the network in the previous NeuGuS framework [12] for the synthesis of logical formulas. Each diamond-shaped node in the second layer, whose output is tanh(b+w1x1+· · ·+wkxk), is intended to recognize linear inequalities of the form c<sup>0</sup> + c1x<sup>1</sup> + · · · ckx<sup>k</sup> ≥ d where |d| is a small integer, and ci/c<sup>0</sup> = wi/b. The idea is that the value of the node tanh(b+w1x1+· · ·+wkxk) = tanh((b/c0) · (c<sup>0</sup> + c1x<sup>1</sup> + · · · + ckxk)) is close to −1 or 1 when both |b/c0| and |c<sup>0</sup> +c1x<sup>1</sup> +· · ·+ckxk| are large, so that the node carries only information about whether c<sup>0</sup> + c1x<sup>1</sup> + · · · ckx<sup>k</sup> ≥ d holds for each d such that |d| is small. The diamond-shaped nodes in the third and fourth layers are intended to compute the Boolean combinations of those linear inequalities and Boolean inputs/hidden states.

The rest of the FF network, for computing the ⊗-nodes in the fourth layer, is intended to compute conditional expressions of the form

$$\text{if } b \text{ then } c\_0 + c\_1 x\_1 + \dots + c\_k x\_k \text{ else } 0,$$

where b is a logical combination of linear inequalities and Boolean inputs/hidden states. Each circle node in the second layer compute the part c0+c1x1+· · ·+ckxk, each node in the lower group of the third layer computes the Boolean value b, and each ⊗-node emulates the conditional expression. The idea is that the Boolean value b is actually represented as a value in [−1, 1] where values close to −1 and 1 are respectively interpreted as false and true. Thus, <sup>b</sup>+1 2 (c0+c1x1+· · ·+ckxk) is close to c<sup>0</sup> +c1x<sup>1</sup> +· · ·+ckx<sup>k</sup> when b represents true, and it is close to 0 when b represents false. Note that the general conditional if b then e<sup>1</sup> else e<sup>2</sup> can be expressed by (if b then e<sup>1</sup> else 0)+ (if ¬b then e<sup>2</sup> else 0) = <sup>b</sup>+1 2 e<sup>1</sup> + −b+1 2 e2, which can be computed in the next cycle if we have hidden states that correspond to if b then e<sup>1</sup> else 0 and if ¬b then e<sup>2</sup> else 0.

Remark 1. As explained above, the internal structure of our RNN is specialized for the purpose of solving our synthesis problem, and quite different from other popular RNNs. The ⊗-node is a reminiscent of a multiplicative gate of LSTM [9], but its main role is to emulate a conditional expression, rather than to address the problems of conventional RNNs such as the gradient vanishing problem. In fact, we do not expect that our RNN scales for very long lists. Fortunately, however, training data with short lists would often suffice for our synthesis problem.

#### 3.2 Training the RNN

Let R be the set of real numbers and g ∈ [−1, 1]<sup>h</sup>+<sup>m</sup> ×R<sup>n</sup>+<sup>k</sup> → [−1, 1]<sup>h</sup> ×R<sup>k</sup> be the function computed by the FF network. The function f ∈ ([−1, 1]<sup>m</sup> ×R<sup>n</sup>) <sup>∗</sup> → [−1, 1] computed by the whole RNN is defined by: f(`) = f 0 (e1, `, <sup>e</sup>0), where:

$$f'(b\_1, \ldots, b\_h, [], \widetilde{0}) = b\_1 \quad f'(\widetilde{b}, x :: \ell', \widetilde{z}) = f'(\widetilde{b}', \ell', \widetilde{z}') \text{ where } (\widetilde{b}', \widetilde{z}') = g(\widetilde{b}, x, \widetilde{z}).$$

$$\text{Here, } f' \in [-1, 1]^h \times ([-1, 1]^m \times \mathbf{R}^n)^\* \times \mathbf{R}^k \to [-1, 1].$$

For an atom <sup>p</sup>(v, <sup>e</sup> <sup>w</sup>e) with <sup>v</sup><sup>e</sup> <sup>∈</sup> <sup>B</sup><sup>m</sup> and <sup>w</sup><sup>e</sup> <sup>∈</sup> <sup>Z</sup> <sup>n</sup>, we write <sup>O</sup>p(v, <sup>e</sup> <sup>w</sup>e) for <sup>f</sup>(v<sup>e</sup> † , <sup>w</sup>e), where true† = 1 and false† = −1. For an implication constraint a<sup>1</sup> ∧· · ·∧a<sup>k</sup> ⇒ b1∨· · ·∨b`, we define the loss loss<sup>a</sup>1∧···∧ak⇒b1∨···∨b` for the implication constraint by:<sup>3</sup>

$$\log s\_{a\_1 \wedge \dots \wedge a\_k \Rightarrow b\_1 \vee \dots \vee b\_\ell} := \prod\_{i \in \{1, \dots, k\}} (\frac{1 + O\_{a\_i}}{2})^2 \cdot \prod\_{j \in \{1, \dots, \ell\}} (\frac{1 - O\_{b\_j}}{2})^2 \cdot \prod\_{j \in \{1, \dots, \ell\}} $$

Note that loss<sup>a</sup>1∧···∧ak⇒b1∨···∨b` is 0 just if one of the a<sup>i</sup> 's is false or one of the b<sup>j</sup> 's is true, which matches the meaning of the implication constraint. For a set C = {γ1, . . . , γp} of implication constraints, the overall loss is defined by: loss<sup>C</sup> := P i∈{1,...,p} loss<sup>γ</sup><sup>i</sup> . Using the loss function above, we train the RNN with a gradient descent method.

Adjusting the loss function. The diamond-shaped nodes in Figure 2 are intended to hold Boolean values (which correspond to 1 and −1), but those nodes in the actual RNN trained by using the above loss function may take values close to 0, which cannot be interpreted as true or false. That is problematic during the program synthesis, because the behavior of the RNN may deviate too much from that of an ordinary program to be synthesized. To remedy the problem, we also use a modified version of the loss function, obtained by replacing O<sup>a</sup> in the basic loss function above with O<sup>0</sup> a := O<sup>a</sup> · Q i 1+λv<sup>2</sup> i 1+<sup>λ</sup> where λ ≥ 0 (note that the modified loss function coincides with the basic loss function when λ = 0), and v<sup>i</sup> is the value of a diamond-shaped node in the second or fourth layer of the FF network in Figure 2. This penalizes the use of "non-Boolean values" in diamond-shaped nodes. Note that if v<sup>i</sup> cannot be interpreted as true or false, i.e., if |v<sup>i</sup> | is close to 0, then 1+λv<sup>2</sup> i 1+λ is much smaller than 1; thus, |O<sup>0</sup> a | would also be much smaller than 1, causing a large loss.

## 4 Synthesis Based on the Trained RNN

This section discusses how to construct the function step in Section 2, by using the trained RNN as a hint. From the trained RNN and its runs for training data, we gather and use the following information.


The output of the function step consists of Booleans and integers. We first discuss how to construct the integer part. The integer part of step corresponds

<sup>3</sup> This loss function is different from the one used in [12]. The difference is partly due to the encoding of Boolean values; Kobayashi et al. [12] used 0 for false while we use −1. Another difference is the use of log vs squared loss. We preferred the latter for simplicity, but more experiments are necessary to tune the shape of the loss function.

to the ⊗-nodes of the FF-network in Figure 2, whose values are computed by a function of the form:

$$\begin{aligned} I(\widetilde{r}\_{1,\ldots,m}, \widetilde{v}\_{1,\ldots,m}, \widetilde{u}\_{1,\ldots,n}, \widetilde{s}\_{1,\ldots,k}) &:= \\ B(\widetilde{r}\_{1,\ldots,m}, \widetilde{v}\_{1,\ldots,m}, \widetilde{u}\_{1,\ldots,n}, \widetilde{s}\_{1,\ldots,k}) &\otimes (b\_0 + \sum\_{i \in \{1,\ldots,n\}} w\_i u\_i + \sum\_{j \in \{1,\ldots,k\}} w'\_j s\_j), \end{aligned}$$

where <sup>r</sup>e, <sup>v</sup>e, <sup>u</sup>e, and <sup>s</sup><sup>e</sup> respectively represent the hidden Boolean states, Boolean inputs, integer inputs, and hidden integer states; the function B is the output of a node in the lower half in the third layer in Figure 2; the part b<sup>0</sup> + · · · is the output of a circle node in the second layer; and x ⊗ y = x+1 2 y as defined before.

Since the value of I is b<sup>0</sup> + P <sup>i</sup>∈{1,...,n} wix<sup>i</sup> + P <sup>j</sup>∈{1,...,k} w 0 j y<sup>j</sup> if the value of B is 1, and 0 if the value of B is −1, one may be tempted to construct the corresponding program expression as:

$$\text{if } \varphi\_B \text{ then } b\_0 + \sum\_{i \in \{1, \ldots, n\}} w\_i u\_i + \sum\_{j \in \{1, \ldots, k\}} w'\_j s\_j \text{ else } 0,$$

where ϕ<sup>B</sup> is a Boolean expression corresponding to B. That is problematic, however, because we wish to construct an integer program expression, but the weights and bias (w<sup>i</sup> , w<sup>0</sup> j , b0) may be arbitrary floating point numbers. We thus rescale the coefficients w<sup>i</sup> , w<sup>0</sup> j , and b<sup>0</sup> as follows. We first pick integers c0, c1, . . . , c<sup>n</sup> and a real number r so that rb0, rw1, . . . , rw<sup>n</sup> are close to c0, c1, . . . , cn. For w 0 j , we just pick an integer c 0 j close to w 0 j , and prepare the integer expression:

$$\text{if } \varphi\_B \text{ then } c\_0 + \sum\_{i \in \{1, \ldots, n\}} c\_i u\_i + \sum\_{j \in \{1, \ldots, k\}} c'\_j s\_j \text{ else } 0,$$

and use it as the integer-part of the function step.

Before constructing Boolean expressions (including ϕB), we adjust (i) the hidden integer states in the run history of RNNs for training data and (ii) the weights for the hidden integer nodes accordingly, to reflect the re-scaling of the coefficients for computing hidden integer states. We multiply (i) with r, and divide (ii) by r. To see the need for the adjustment, let us recall the step function for the sortedness:

$$step(u, r\_b, r\_i) = (r\_b \land (r\_i \le u), u).$$

The RNN may actually learn the following function:

$$step(u, r\_b, r\_i) = (r\_b \land (2r\_i \le u), 0.5u).$$

Suppose we have re-scaled 0.5u to u, to make the coefficient an integer. That would increase the value of the hidden integer state by a factor of 2, so that the coefficient of r<sup>i</sup> in the inequality 2r<sup>i</sup> ≤ u should be decreased by half, to obtain r<sup>i</sup> ≤ u. We can thus obtain

$$step(u, r\_b, r\_i) = (r\_b \land (r\_i \le u), u)$$

correctly.


Table 1. The value of each node of the FF-network for [2; 3; 5].

Example 1. As a concrete example, consider the synthesis of a sortedness predicate sorted, which takes a list ` and returns whether ` is sorted in the ascending order. We set h = n = k = 1, and m = 0. The numbers of hidden nodes in the upper-half of the second layer and those in the upper-half of the third layer were both set to 4. We have trained the network by using 200 positive samples (like ⇒ sorted([2; 3; 5])) and 94 negative samples (like sorted([9; 8]) ⇒). After the training, we re-ran the RNN for the training data, and collected the value of each node of the FF-network. For example, for the data [2; 3; 5], we obtained the information shown on the left-hand side of Table 1. Here, the first group (separated by the horizontal line), shows the values of the nodes for the first element 2 of the list, and the second group shows those for the second element 3. We also look at the weights and biases of the FF-network to synthesize the target function step.

By inspecting the weights and bias for the the circle node in the second layer, we can find that the function computed by the node is: −0.023+1.128u−0.045s, where u and s respective denote the values of the integer input and the hidden integer state. The ratio between the constant and the coefficient of u is about 0 : 1, and the co-efficient of s is close to 0. Thus, we set the integer expression to compute the next hidden integer state to if ϕ<sup>B</sup> then u else 0, where the condition ϕ<sup>B</sup> is yet to be synthesized.

The replacement of −0.023 + 1.128u − 0.045s with u results in the decrease of the value of the hidden integer state by a factor of 1/1.128, as shown on the right-hand side of Table 1. The weights for the nodes in the second layer are also accordingly re-scaled. ut

It remains to construct Boolean expressions, consisting of linear inequalities on integer variables and Boolean variables. That can be achieved in a manner similar to [12]; we have, however, adopted the following procedure, which utilizes information about the value of each node in the FF network. In contrast, Kobayashi et al.'s method [12] uses only the weights and biases, in addition to the input and output for each training data; they did not utilize the values of internal nodes for each training data.

We synthesize linear inequalities corresponding to the diamond-shaped nodes in the second layer as follows. Let

$$\tanh(b\_0 + w\_1 u\_1 + \dots + w\_n u\_n + w\_{n+1} s\_1 + \dots + w\_{n+k} s\_k)$$

be the value computed by a diamond-shaped node in the second layer (where we assume that the weights wn+1, . . . , wn+<sup>k</sup> have already been re-scaled). Let c0, c1, . . . , cn+<sup>k</sup> be integers whose ratios are close to those of b0, w1, . . . , wn+k. Then we set the corresponding inequality to

$$c\_0 + c\_1 u\_1 + \dots + c\_n u\_n + c\_{n+1} s\_1 + \dots + c\_{n+k} s\_k > e,$$

where e ∈ {−1, 0, 1} is chosen so that the truth value of the inequality bestmatches the actual input-output behavior of the node for training data; recall the discussion in Section 3.1.

Next, we construct Boolean functions corresponding to the diamond-shape nodes in the fourth layer and the lower-half of the third layer in Figure 2. This is performed by first constructing the truth tables for those functions based on the runs of the RNN for the training data, and then using a method for Boolean decision tree construction [7],<sup>4</sup> where Boolean variables and the inequalities synthesized above are used as qualifiers (i.e., atomic predicates that constitute Boolean functions). Those qualifiers are prioritized based on the weights for the nodes in the third and fourth layers. The synthesized functions may not completely match the truth tables if appropriate inequalities have not been found in the previous step. Even so, we proceed to the next step to construct the step function and test it; recall that in our gray-box use of the neural network, the internal behavior of the synthesized program need not completely match that of the RNN.

Example 2. Recall Example 1. The next step is to synthesize linear inequalities from the (re-scaled) weights of the nodes in the second layer. After the re-scaling of weights, the functions computed by the diamond-shaped nodes are:

$$
\tanh(1.396 + 0.876u + 1.182s) \quad \tanh(1.066 + 1.084u - 1.052s) \quad \cdots \cdot \frac{1}{2}
$$

Based on the ratios between the constant and coefficients, we synthesize linear inequalities of the form:

4 + 3u + 4s > e<sup>1</sup> 1 + u − s > e<sup>2</sup> − 6 − 4u − 3s > e<sup>3</sup> u − s > e4.

<sup>4</sup> Kobayashi et al. [12] suggested using the Quine-McClusky method for this purpose, but we prefer the Boolean decision tree construction for two reasons. First, the Quine-McClusky method would not scale when the dimension is large. Second, we wish to give priorities to some qualifiers as explained below.

We then check the re-scaled trace information (such as the one in Table 1, but including the trace information for all the training data), we choose appropriate values for each e<sup>i</sup> . In the present case, we obtain:

$$4 + 3u + 4s > 0 \qquad 1 + u - s > -1 \qquad -7 - 3u - 4s > 0 \qquad u - s > -1.$$

It remains to synthesize Boolean functions. To this end, for each diamondshaped node in the fourth layer and in the lower-half of the third layer, we construct a truth table, where the inputs are Boolean values obtained by discretizing the values of the diamond-shaped nodes in the first and second layers. For example, from Table 2, we obtain the following truth table for the diamondshaped node in the fourth layer. The duplicated rows can be removed before the synthesis of a logical function.


Here, I<sup>0</sup> corresponds to the value of the hidden Boolean node, and I1–I<sup>4</sup> correspond to the diamond-shaped nodes in the second layer, which represent inequality constraints extracted above. We interpret values close to 1 (say, those greater than 0.5) as true, and those close to −1 (say, those less than −0.5) as false, ignoring the other values.

Once a truth table has been constructed, we can apply a classical method to synthesize a logical function that conforms to the truth table. In our implementation, we have employed a technique for Boolean decision tree construction; instead of computing the entropy [7], however, we have prioritized Boolean inputs (I0–I4, in the above case) based on the weights for the nodes in the third and fourth layer, which indicate which Boolean inputs affected the output node.

Suppose that the logical function O = I<sup>0</sup> ∧ I<sup>4</sup> has been synthesized in the above example. Suppose also that the constant function true has been synthesized for the diamond-shaped node in the third layer. Since I<sup>0</sup> corresponds to the hidden Boolean state, and I<sup>4</sup> corresponds to the inequality u − s > −1 (which is equivalent to s ≤ u), we obtain

$$step(u, r\_b, s) = (r\_b \land (s \le u), \text{if } \text{true then } u \text{ else } 0),$$

as the step function. ut

By combining the procedures above, we can construct the function step. After constructing the step function, we test the synthesized recursive function against training data, and check whether the outputs of the synthesized function satisfy all the implication constraints. If some constraints are not satisfied, we re-train the RNN and repeat the synthesis procedure above. To avoid the re-training of the RNN from scratch, however, we first fix the part for computing the hidden integer states. This is because the process of re-scaling the parameters for the hidden integer states as explained above is costly and error-prone. Upon repeated failures, however, we reset all the parameters of the RNN and re-train it from scratch.

## 5 Implementation and Experiments

We have implemented a tool called NeuGuSR for the synthesis of recursive predicates based on the method described above in OCaml using the machine learning framework ocaml-torch (https://github.com/LaurentMazare/ocaml-torch), which is an OCaml interface for the PyTorch library. Our tool is available at https://github.com/naokikob/neugusR. This section describes the experiments we conducted that confirm the effectiveness of our approaches.

All the experiments below were conducted on a laptop computer with Intel(R) Core(TM) i5-8265U CPU (1.60GHz) and 8 GB memory. Training was done using only CPU.

#### 5.1 Dataset and predicates

We have prepared 11 recursive predicates over integer lists and integers for synthesis. Examples include predicates such as max(l, n) which says the largest element of l is n, sumle(l1, l2) which says the sum of l<sup>1</sup> is less than or equal to the sum of l2, and predicates sorted(l) and avge(l, n) as already described in Section 1.

For experiments, we consider positive constraints (of the form ⇒ a<sup>k</sup> where a<sup>k</sup> ∈ Atoms<sup>K</sup> and K is the corresponding signature of the predicate), negative constraints (of the form b<sup>k</sup> ⇒), as well as general implication constraints as defined in Section 2.

For each problem (predicate), we performed 3 runs to see if the solver was able to synthesize a program that matches all the training examples. We set the time limit of each run to 1200 seconds. In each run, the neural network is trained for 30000 steps by default. At each step, all the training examples of the predicate were used to optimize the neural network. In each run, the training was terminated early if the accuracy reached 100% on the training examples and the loss was less than a threshold, which in the current setting is 10<sup>−</sup><sup>6</sup> .

If the accuracy did not reach 100% within 30000 steps or there are constraints not satisfied by the synthesized program, the training was set to restart with fresh parameters except for the weights of the hidden integer states. If there are three consecutive failures of convergence, however, we reset all the parameters and restart training from scratch.

We used the Adam optimizer [11] for training with the default setting of ocaml-torch (β<sup>1</sup> = 0.9, β<sup>2</sup> = 0.999 without weight decay), and the learning rate was 0.001. Learned parameters are not shared between different problems.

### 5.2 Evaluation

The specification of RNN used for each problem is as follows. For all the predicates other than updown and max, we used 4 nodes for the second layer of the RNN, 8 nodes for the third layer of the RNN, and 1 node each for the integer hidden state and the boolean hidden state. For updown and max, we used 2 nodes for the boolean hidden states and 16 nodes for the third layer. For max, we also used 8 nodes for the second layer instead of 4.

We report the performance of our tool NeuGuSR with respect to the following metrics.


Table 2 shows the performance of NeuGuSR for each predicate. It can be seen that NeuGuSR was able to solve all the problems consistently, with the only exception of max which failed once due to a timeout. The small number of retries triggered during the synthesis of each predicate suggests that our approach is effective. Our RNN was able to classify the positive and negative examples very well, because otherwise multiple restarts of training would have been forced even before entering the extraction phase. Our extraction procedure was also reasonably accurate — while errors could occur, they were quickly fixed within a few retries (3 on average as can be seen in Table 2).


Table 2. Performance on the predicates to be synthesized.

The predicate max is the only predicate that involves equality among the 11 predicates, which probably explains why it is the most difficult one. The fact that max can be synthesized was more of a surprise which demonstrated the generality of our approach to some extent. While our framework was not designed specifically to handle equalities, the neural network, if lucky, might still be able to find clever ways to express equalities using inequalities. This is one of the reasons we specified 8 nodes for the second layer when dealing with max — the more inequalities we have, the more likely a combination of them happens to express certain equality.

Remark 2. We could not find any previous tool that can be directly compared with ours. A possible alternative approach to our synthesis problem would be to prepare a template for the step function, generate constraints on parameters in the template, and use an SMT solver to solve them.

## 6 Related Work

As already mentioned, the present work may be considered an extension of Kobayashi et al.'s NeuGuS framework [12], where feed-forward neural networks are used as gray-boxes to synthesize formulas of quantifier-free linear integer arithmetic. We have significantly expanded the scope of NeuGuS, by enabling the synthesis of recursive predicates on lists; to that end, we have employed special-purpose recursive neural networks.

Our work has been partially motivated by Shimoda et al.'s work on an extension of symbolic automata called symbolic automatic relations (SARs) [19]. They introduced SARs to express recursive predicates on lists, and used them to express loop invariants on lists (more precisely, to express candidate solutions for the CHC satisfiability problem [1]) for automated verification of listmanipulating programs. They left it to future work how to automatically infer SARs from positive, negative, and implication constraints. Our work fills that gap, since the class of programs synthesized in our framework corresponds to their SARs (more precisely, Σsar 1 -formulas [19]). Further refinement and optimizations would be, however, required for our tool to be effectively used in that context.

Our work is also related to neural network-based approaches to the synthesis of finite automata [16,21]. Our method deals with a much wider class of programs involving integers and integer lists. Also, the problem setting is slightly different; Weiss et al.'s method [21] takes a trained RNN as the ground truth, and aims to construct an automaton whose behavior matches that of the RNN. In contrast, in our approach, we allow the behavior of the synthesized program and that of the RNN to be different for inputs other than those given as training data. This is because in the NeuGuS framework, the trained RNN is supposed to be used just as a hint, and does not necessarily provide the ground truth. The ground truth is determined from the whole verification or synthesis goal [12,13], as discussed at the end of Section 2. In the context of program verification, the synthesized predicate is used as a candidate program invariant, and it is checked whether it is indeed an inductive invariant; if not, then new training data are added and

NeuGuS should be repeated. In the context of oracle-based program synthesis, the synthesized function is used as a component of the whole program, and then it is checked whether the whole program satisfies a specification; if not, then new training data for the function are generated and NeuGuS should be repeated. Recently, the above line of work has also been further extended to infer weighted automata [22,15] and context-free grammars [23], which are incompatible with the class of programs synthesized by our method.

There have been studies of other approaches to program synthesis based on neural networks, most notably, those based on transformers [3,18,17]. Both the problem settings and approaches (the ways in which neural networks are used) are quite different between those studies and our work. Our goal is to synthesize programs from positive/negative/implication constraints (where those constraints are added as necessary in the whole loop of program verification or synthesis), and it is not clear to us how to effectively apply transformers-based approaches to program synthesis for that purpose. Whilst the transformers-based approaches can in principle be used for our program synthesis problem, huge training data (which consist of pairs of positive/negative/implication constraints and a program that satisfies the constraints) would be required and they might not work well for the synthesis of unseen programs. Other neural network-based approaches include that of AlphaTensor [5], which used deep reinforcement learning to discover new matrix multiplication algorithms.

The synthesis of predicates from positive/negative samples (but without implication constraints) is an instance of the well-studied problem of programming by examples (PBE). PBE has been successful especially in the synthesis of stringto-string functions in DSL [8], and machine learning has also been recently applied [20]. To our knowledge, however, the synthesis of recursive functions has not been much studied in that context.

## 7 Conclusion

We have proposed a novel approach to automated synthesis of recursive predicates on lists, as an extension of Kobayashi et al.'s neural-network-guided synthesis (NeuGuS) [12]. We have designed a special-purpose recursive neural network and devised a method to synthesize a recursive predicate by using the trained network as a hint. We have implemented a synthesis tool based on the method and confirmed that the tool works reasonably well for various examples. We plan to further refine the tool and deploy it in the context of automated verification of list-manipulating programs [19] and oracle-based program synthesis [13]. We also plan to extend the method to enable the synthesis of a larger class of recursive programs, including more general list-processing programs that go beyond the "fold" functions, and tree-processing programs.

#### Acknowledgments

We would like to thank anonymous referees for useful comments. This work was supported by JSPS KAKENHI Grant Number JP20H05703.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Automata**

## **Modular Mix-and-Match Complementation of Buchi Automata ¨**

Vojtech Havlena ˇ 1(B) , Ondˇrej Lengal´ 1(B) , Yong Li2,3(B) , Barbora Smahl ˇ ´ıkova´ 1(B) , and Andrea Turrini3,4(B)

1 Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ihavlena@fit.vut.cz, lengal@vut.cz, xsmahl00@vut.cz

2 Department of Computer Science, University of Liverpool, Liverpool, UK liyong@liverpool.ac.uk

3 State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, People's Republic of China turrini@ios.ac.cn

4 Institute of Intelligent Software, Guangzhou, Guangzhou, People's Republic of China

**Abstract.** Complementation of nondeterministic B¨uchi automata (BAs) is an important problem in automata theory with numerous applications in formal verifcation, such as termination analysis of programs, model checking, or in decision procedures of some logics. We build on ideas from a recent work on BA determinization by Li *et al.* and propose a new modular algorithm for BA complementation. Our algorithm allows to combine several BA complementation procedures together, with one procedure for a subset of the BA's strongly connected components (SCCs). In this way, one can exploit the structure of particular SCCs (such as when they are inherently weak or deterministic) and use more efcient specialized algorithms, regardless of the structure of the whole BA. We give a general framework into which partial complementation procedures can be plugged in, and its instantiation with several algorithms. The framework can, in general, produce a complement with an Emerson-Lei acceptance condition, which can often be more compact. Using the algorithm, we were able to establish an exponentially better new upper bound of O (4 ) for complementation of the recently introduced class of elevator automata. We implemented the algorithm in a prototype and performed a comprehensive set of experiments on a large set of benchmarks, showing that our framework complements well the state of the art and that it can serve as a basis for future efcient BA complementation and inclusion checking algorithms.

## **1 Introduction**

Nondeterministic B ¨uchi automata (BAs) [8] are an elegant and conceptually simple framework to model infnite behaviors of systems and the properties they are expected to satisfy. BAs are widely used in many important verifcation tasks, such as termination analysis of programs [30], model checking [54], or as the underlying formal model of decision procedures for some logics (such as S1S [8] or a fragment of the frst-order logic over Sturmian words [31]). Many of these applications require to perform *complementation* of BAs: For instance, in termination analysis of programs within Ultimate Automizer [30], complementation is used to keep track of the set of paths whose termination still needs to be proved. On the other hand, in model checking5 and decision

© The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 249–270, 2023. https://doi.org/10.1007/978-3-031-30823-9 13

<sup>5</sup> Here, we consider model checking w.r.t. a specifcation given in some more expressive logic, such as S1S [8], QPTL [50], or HyperLTL [12], rather than LTL [44], where negation is simple.

procedures of logics, complement is usually used to implement negation and quantifer alternation. Complementation is often the most difcult automata operation performed here; its worst-case state complexity is O ( (0.76) ) [48,2] (which is tight [55]).

In these applications, efciency of the complementation often determines the overall efciency (or even feasibility) of the top-level application. For instance, the success of Ultimate Automizer in the Termination category of the International Competition on Software Verifcation (SV-COMP) [51] is to a large degree due to an efcient BA complementation algorithm [6,11] tailored for BAs with a special structure that it often encounters (as of the time of writing, it has won 6 gold medals in the years 2017–2022 and two silver medals in 2015 and 2016). The special structure in this case are the socalled *semi-deterministic BAs* (SDBAs), BAs consisting of two parts: (i) an initial part without accepting states/transitions and (ii) a deterministic part containing accepting states/transitions that cannot transition into the frst part.

Complementation of SDBAs using one from the family of the so-called NCSB algorithms [6,5,11,28] has the worst-case complexity O (4 ) (and usually also works much better in practice than general BA complementation procedures). Similarly, there are efcient complementation procedures for other subclasses of BAs, e.g., (i) *deterministic BAs* (DBAs) can be complemented into BAs with 2 states [35] (or into co-B ¨uchi automata with + 1 states) or (ii) *inherently weak BAs* (BAs where in each *strongly connected component* (SCC), either all cycles are accepting or all cycles are rejecting) can be complemented into DBAs with O (3 ) states using the Miyano-Hayashi algorithm [42].

For a long time, there has been no efcient algorithm for complementation of BAs that are highly structured but do not fall into one of the categories above, e.g., BAs containing inherently weak, deterministic, and some nondeterministic SCCs. For such BAs, one needed to use a general complementation algorithm with the O ( (0.76) ) (or worse) complexity. To the best of our knowledge, only recently has there appeared works that exploit the structure of BAs to obtain a more efcient complementation algorithm: (i) The work of Havlena *et al.* [29], who introduce the class of *elevator automata* (BAs with an arbitrary mixture of inherently weak and deterministic SCCs) and give a O (16 ) algorithm for them. (ii) The work of Li *et al.* [37], who propose a BA determinization procedure (into a deterministic Emerson-Lei automaton) that is based on decomposing the input BA into SCCs and using a diferent determinization procedure for diferent types of SCCs (inherently weak, deterministic, general) in a synchronous construction.

In this paper, we propose a new BA complementation algorithm inspired by [37], where we exploit the fact that complementation is, in a sense, more relaxed than determinization. In particular, we present a *framework* where one can plug-in diferent partial complementation procedures fne-tuned for SCCs with a specifc structure. The procedures work only with the given SCCs, to some degree *independently* (thus reducing the potential state space explosion) from the rest of the BA. Our top-level algorithm then orchestrates runs of the diferent procedures in a *synchronous* manner (or completely independently in the so-called *postponed* strategy), obtaining a resulting automaton with potentially a more general acceptance condition (in general an Emerson-Lei condition), which can help keeping the result small. If the procedures satisfy given correctness requirements, our framework guarantees that its instantiation will also be correct. We also propose its optimizations by, e.g., using round-robin to decrease the amount of nondeterminism, using a shared breakpoint to reduce the size and the number of colours for certain class of partial algorithms, and generalize simulation-based pruning of macrostates.

We provide a detailed description of partial complementation procedures for inherently weak, deterministic, and initial deterministic SCCs, which we use to obtain a *new* exponentially better upper bound of O (4 ) for the class of elevator automata (i.e., the same upper bound as for its strict subclass of SDBAs). Furthermore, we also provide two partial procedures for general SCCs based on determinization (from [37]) and the rank-based construction. Using a prototype implementation, we then show our algorithm complements well existing approaches and signifcantly improves the state of the art.

## **2 Preliminaries**

We fx a fnite non-empty alphabet Σ and the frst infnite ordinal . An (infnite) word is a function : → Σ where the -th symbol is denoted as . Sometimes, we represent as an infnite sequence = 0<sup>1</sup> . . . We denote the set of all infnite words over Σ as Σ ; an *-language* is a subset of Σ .

*Emerson-Lei Acceptance Conditions.* Given a set Γ = {0, . . . , −1} of *colours* (often depicted as **0** , **1** , etc.), we defne the set of *Emerson-Lei acceptance conditions* EL(Γ) as the set of formulae constructed according to the following grammar:

$$\alpha \coloneqq \mathsf{Inf}(c) \mid \mathsf{Fin}(c) \mid (\alpha \land \alpha) \mid (\alpha \lor \alpha) \tag{1}$$

for ∈ Γ. The *satisfaction* relation |= for a set of colours ⊆ Γ and condition is defned inductively as follows (for ∈ Γ):

$$\begin{aligned} M &\models \mathsf{Fin}(c) \quad \text{iff} \; c \notin M, & \qquad M \models \alpha\_1 \lor \alpha\_2 \text{ iff } \; M \models \alpha\_1 \text{ or } M \models \alpha\_2, \\\ M &\models \mathsf{Inf}(c) \quad \text{iff} \; c \in M, & \qquad M \models \alpha\_1 \land \alpha\_2 \text{ iff } \; M \models \alpha\_1 \text{ and } M \models \alpha\_2. \end{aligned}$$

*Emerson-Lei Automata.* A (nondeterministic transition-based6) *Emerson-Lei automaton* (TELA) over Σ is a tuple A = (, , , Γ, p, Acc), where is a fnite set of *states*, ⊆ × Σ × is a set of *transitions*7, ⊆ is the set of *initial* states, Γ is the set of *colours*, p: → 2 Γ is a *colouring function* of transitions, and Acc ∈ EL(Γ). We use <sup>→</sup> to denote that (, , ) ∈ and sometimes also treat as a function : <sup>×</sup> <sup>Σ</sup> <sup>→</sup> 2 . Moreover, we extend to sets of states ⊆ as (, ) = Ð ∈ (, ). We use A [] for ∈ to denote the automaton A [] = (, , {}, Γ, p, Acc), i.e., the TELA obtained from A by setting as the only initial state. A is called *deterministic* if || ≤ 1 and |(, )| ≤ 1 for each ∈ and ∈ Σ. If Γ = { **<sup>0</sup>** } and Acc = Inf( **<sup>0</sup>** ), we call A a *Buchi automaton ¨* (BA) and denote it as A = (, , , ) where is the set of all transitions coloured by **<sup>0</sup>** , i.e., = p −1 ({ **0** }). For a BA, we use (, ) = { ∈ (, ) | p( <sup>→</sup> ) <sup>=</sup> { **<sup>0</sup>** }} (and extend the notation to sets of states as for ). A BA A = (, , , ) is called *semi-deterministic* (SDBA) if for every accepting transition ( <sup>→</sup> ) ∈ , the reachable part of A [] is deterministic.

A *run* of A from ∈ on an input word is an infnite sequence : → that starts in and respects , i.e., <sup>0</sup> = and ∀ ≥ 0: <sup>→</sup> +<sup>1</sup> <sup>∈</sup> . Let inf () ⊆ denote the set of transitions occurring in infnitely often and inf<sup>Γ</sup> () = Ð {p() | ∈

<sup>6</sup> We only consider transition-based acceptance in order to avoid cluttering the paper by always dealing with accepting states *and* accepting transitions. Extending our approach to state/transition-based (or just state-based) automata is straightforward.

<sup>7</sup> Note that some authors use a more general defnition of TELAs with ⊆ × Σ × 2 <sup>Γ</sup> × ; we only use them as the output of our algorithm, where the simpler defnition sufces.

inf ()} be the set of infnitely often occurring colours. A run is *accepting* in A if inf<sup>Γ</sup> () |= Acc and the *language* of A, denoted as L (A), is defned as the set of words ∈ Σ for which there exists an accepting run in A starting with some state in .

Consider a BA A = (, , , ). For a set of states ⊆ we use A to denote the copy of A where accepting transitions only occur between states from , i.e., the BA <sup>A</sup> <sup>=</sup> (, , , <sup>∩</sup> | ) where | = { <sup>→</sup> <sup>∈</sup> <sup>|</sup> , <sup>∈</sup> }. We say that a non-empty set of states ⊆ is a *strongly connected component* (SCC) if every pair of states of can reach each other and is a maximal such set. An SCC of A is *trivial* if it consists of a single state that does not contain a self-loop and *non-trivial* otherwise. An SCC is *accepting* if it contains at least one accepting transition and *inherently weak* if either (i) every cycle in contains a transition from or (ii) no cycle in contains any transitions from . An SCC is *deterministic* if the BA (, | , {}, ∅) for any ∈ is deterministic. We denote inherently weak components as IWCs, accepting deterministic components that are not inherently weak as DACs (deterministic accepting), and the remaining accepting components as NACs (nondeterministic accepting). A BA A is called an *elevator automaton* if it contains no NAC.

We assume that A contains no accepting transition outside its SCCs (no run can cycle over such transitions). We use SCC to denote the restriction of to transitions that do not leave their SCCs, formally, SCC = { <sup>→</sup> <sup>∈</sup> <sup>|</sup> and are in the same SCC}. A *partition block* ⊆ of A is a nonempty union of its accepting SCCs, and a *partitioning* of A is a sequence 1, . . . , of pairwise disjoint partition blocks of A that contains all accepting SCCs of A. Given a , let A be the BA obtained from A by removing colours from transitions outside . The following fact serves as the basis of our decomposition-based complementation procedure.

**Fact 1.** L (A) = L (A<sup>1</sup> ) ∪ . . . ∪ L (A )

The complement (automaton) of a BA A is a TELA that accepts the complement language Σ \ L (A) of L (A). In the paper, we call a state and a run of a complement automaton a *macrostate* and a *macrorun*, respectively.

## **3 A Modular Complementation Algorithm**

In a nutshell, the main idea of our BA complementation algorithm is to frst decompose a BA A into several partition blocks according to their properties, and then perform complementation for each of the partition blocks (potentially using a diferent algorithm) independently, using either a *synchronous* construction, synchronizing the complementation algorithms for all partition blocks in each step, or a *postponed* construction, which complements the partition blocks independently and combines the partial results using automata product construction. The decomposition of A into partition blocks can either be trivial—i.e., with one block for each accepting SCC—, or more elaborate, e.g., a partitioning where one partition block contains all accepting IWCs, another contains all DACs, and each NAC is given its own partition block. In this way, one can avoid running a general complementation algorithm for unrestricted BAs with the state complexity upper bound O ( (0.76) ) and, instead, apply the most suitable complementation procedure for each of the partition blocks. This comes with three main advantages:

1. The complementation algorithm for each partition block can be selected diferently in order to exploit the properties of the block. For instance, for partition blocks with IWCs, one can use complementation based on the breakpoint (the so-called Miyano-Hayashi) construction [42] with O (3 ) macrostates (cf. Sec. 4.1), while for partition blocks with only DACs, one can use an algorithm with the state complexity O (4 ) based on an adaptation of the NCSB construction [6,5,11,28] for SDBAs (cf. Sec. 4.2). For NACs, one can choose between, e.g., rank- [34,21,48,10,24,29] or determinization-based [46,43,45] algorithms, depending on the properties of the NACs (cf. Sec. 6).


Those partial complementation algorithms then need to be orchestrated by a top-level algorithm to produce the complement of A.

One might regard our algorithm as an optimization of an approach that would for each partition block obtain a BA A, complement A using the selected algorithm, and perform the intersection of all obtained A's (which would, however, not be able to get the upper bound for elevator automata that we give in Sec. 4.3). Indeed, we also implemented the mentioned procedure (called the *postponed* approach, described in Sec. 5.2) and compared it to our main procedure (called the *synchronous* approach).

#### **3.1 Basic Synchronous Algorithm**

In this section, we describe the basic *synchronous* top-level algorithm. Then, in Sec. 4, we provide its instantiation for elevator automata and give a new upper bound for their complementation; in Sec. 5, we discuss several optimizations of the algorithm; and in Sec. 6, we give a generalization for unrestricted BAs. Let us fx a BA A = (, , , ) and, w.l.o.g., assume that A is *complete*, i.e., || > 0 and all states ∈ have an outgoing transition over all symbols ∈ Σ.

The synchronous algorithm works with partial complementation algorithms for BA's partition blocks. Each such algorithm Alg is provided with a structural condition Alg characterizing partition blocks it can complement. For a BA B, we use the notation B |= to denote that B satisfes the condition . We say that Alg is a *partial complementation algorithm for a partition block* if A |= Alg. We distinguish between Alg, a general algorithm able to complement a partition block of a given type, and Alg , its instantiation for the partition block . Each instance Alg is required to provide the following:


macrostate for the given partition block, is the input symbol, and each ( , ) is a pair (*macrostate*, *set of colours*) such that is a successor of over w.r.t. and is a set of colours on the edge from to ( helps to keep track of *new* runs coming into the partition block); and

**–** AccAlg ∈ EL(ColoursAlg ) — the acceptance condition.

Let 1, . . . , be a partitioning of A (w.l.o.g., we assume that > 0), and Alg<sup>1</sup> , . . . , Alg be a sequence of algorithms such that Alg is a partial complementation algorithm for . Furthermore, let us defne the following auxiliary *renumbering* function as (, ) = + <sup>Í</sup>−<sup>1</sup> =1 <sup>|</sup>ColoursAlg |, which is used to make the colours and acceptance conditions from the partial complementation algorithms disjoint. We also lift to sets of colours in the natural way, and also to EL conditions such that (, ) has the same structure as but each atom Inf() is substituted with the atom Inf((, )) (and likewise for Fin atoms). The synchronous complementation algorithm then produces the TELA ModCompl(Alg<sup>1</sup> 1 , . . . , Alg , A) = ( C , C , C , Γ C , p C , Acc<sup>C</sup> ) with components defned as follows (we use [] =1 to abbreviate <sup>1</sup> × · · · × ):

$$\begin{array}{ll} - \; \mathcal{Q}^{\mathcal{C}} = 2^{\mathcal{Q}} \times [\mathbf{T}^{\mathsf{Alg}\_{P\_{\mathcal{I}}}^{i}}]\_{i=1}^{n}, & \mathsf{f}^{\mathcal{C}:\mathcal{S}\_{1} = 1} \\ - \; I^{\mathcal{C}} = \{I\} \times [\mathtt{Int} \, \mathtt{t}^{\mathsf{Alg}\_{P\_{\mathcal{I}}}^{i}}]\_{i=1}^{n}, & - \; \mathsf{Acc}^{\mathcal{C}} = \wedge\_{i=1}^{n} \, \lambda(\mathtt{Acc}^{\mathsf{Alg}\_{P\_{\mathcal{I}}}^{i}}, i), \text{and} \\ \end{array}$$

**–** C and p C are defned such that if

$$((M\_1', \alpha\_1), \dots, (M\_n', \alpha\_n)) \in [\mathtt{Succ}^{\Lambda \mathtt{1g}\_{P\_i}^j}(H, M\_i, a)]\_{i=1}^n,$$

then C contains the transition : (, 1, . . . , ) → ((, ), ′ 1 , . . . , ′ ), coloured by p C () = Ð {( , ) | 1 ≤ ≤ }, and C is the smallest such a set.

In order for ModCompl to be correct, the partial complementation algorithms need to satisfy certain properties, which we discuss below.

For a structural condition and a BA B = (, , , ), we defne B |= if B |= , is a partition block of B, and B contains no accepting transitions outside . We can now provide the correctness condition on Alg.

**Defnition 1.** *We say that* Alg *is* correct *if for each BA* B *and partition block such that* B |= Alg *it holds that* L (*ModCompl*(Alg , B)) = Σ \ L (B)*.*

The correctness of the synchronous algorithm (provided that each partial complementation algorithm is correct) is then established by Theorem 1.

**Theorem 1.** *Let* A *be a BA,* 1, . . . , *be a partitioning of* A*, and* Alg<sup>1</sup> , . . . , Alg *be a sequence of partial complementation algorithms such that* Alg *is* correct *for . Then, we have* L (*ModCompl*(Alg<sup>1</sup> 1 , . . . , Alg , A)) = Σ \ L (A)*.*

## **4 Modular Complementation of Elevator Automata**

In this section, we frst give partial algorithms to complement partition blocks with only accepting IWCs (Sec. 4.1) and partition blocks with only DACs (Sec. 4.2). Then, in Sec. 4.3, we show that using our algorithm, the upper bound on the size of the complement of elevator BAs is in O (4 ), which is *exponentially better* than the known upper bound O (16 ) established in [29].

<sup>8</sup> If we drop the condition that A is complete, we also need to add an *accepting sink state* (representing the case for = ∅) with self-loops over all symbols marked by a new colour , and enrich Acc<sup>C</sup> with . . . ∨ Inf( ).

#### **4.1 Complementation of Inherently Weak Accepting Components**

First, we introduce a partial algorithm MH with the condition MH specifying that all SCCs in the partition block are *accepting* IWCs. Let be a partition block of A such that A |= MH. Our proposed approach makes use of the Miyano-Hayashi construction [42]. Since in accepting IWCs, all runs are accepting, the idea of the construction is to accept words such that all runs over the words eventually leave .

Therefore, we use a pair (, ) of sets of states as a macrostate for complementing . Intuitively, we use to denote the set of all runs of A that are in ( for "*check*"). The set ⊆ represents the runs being inspected whether they leave at some point ( for "*breakpoint*"). Initially, we let = ∩ and also sample into breakpoint all runs in , i.e., set = . Along reading an -word , if all runs that have entered eventually leave , i.e., becomes empty infnitely often, the complement language of should contain (when becomes empty, we sample with all runs from the current ). We formalize MH as a partial procedure in the framework from Sec. 3.1 as follows:

**–** T MH = 2 × 2 , ColoursMH = { **<sup>0</sup>** }, InitMH = {( ∩ , ∩ )}, **–** AccMH = Inf( **<sup>0</sup>** ), and SuccMH (, (, ), ) = {( ( ′ , ′ ), )} where • ′ = (, ) ∩ , • ′ = ( ′ if ★ = ∅ for ★ = (, ) ∩ ′ , ★ otherwise, and • = ( { **0** } if ★ = ∅ and ∅ otherwise.

We can see that checking whether is accepted by the complement of reduces to check whether has been cleared infnitely often. Since every time when becomes empty, we emit the colour **0** , we have that is not accepted by A within if and only if **<sup>0</sup>** occurs infnitely often. Note that the transition function SuccMH is deterministic, i.e., there is exactly one successor.

**Lemma 1.** *The partial algorithm* MH *is correct.*

#### **4.2 Complementation of Deterministic Accepting Components**

In this section, we give a partial algorithm CSB with the condition CSB specifying that a partition block consists of *DACs*. Let be a partition block of A such that A |= CSB. Our approach is based on the NCSB family of algorithms [6,11,5,28] for complementing SDBAs, in particular the NCSB-MaxRank construction [28]. The algorithm utilizes the fact that runs in DACs are deterministic, i.e., they do not branch into new runs. Therefore, one can check that a run is non-accepting if there is a time point from which the run does not see accepting transitions any more. We call such a run that does not see accepting transitions any more *safe*. Then, an -word is not accepted in if all runs over in either (i) leave or (ii) eventually become safe.

For checking point (i), we can use a similar technique as in algorithm MH, i.e., use a pair (, ). Moreover, to be able to check point (ii), we also use the set that contains runs that are supposed to be *safe*, resulting in macrostates of the form (, , )9. To make sure that all runs are deterministic, we will use SCC instead of when computing the successors of and since there may be nondeterministic jumps between diferent DACs in ; we will not miss any run in since if a run moves between DACs of , it

<sup>9</sup> In contrast to MH, here we use ∪ rather than to keep track of all runs in .

Fig. 1: Left: BA Aex (dots represent accepting transitions). Right: the outcome of ModCompl(CSB<sup>0</sup> , MH<sup>1</sup> , Aex ) with Acc : Inf( **<sup>0</sup>** ) ∧ Inf( **<sup>1</sup>** ). States are given as (, (0, 0, 0), (1, 1)); to avoid too many braces, sets are given as sums.

can be seen as the run leaving and a new run entering . Since a run eventually stays in one SCC, this guarantees that the run will not be missed.

We formalize CSB in the top-level framework as follows:

	- if (, ) ≠ ∅, then = ∅ (Runs in must be *safe*),
		- otherwise contains ( ( ′ , ′ , ′ ), ) where

$$\begin{array}{rcl} \ast \ S' = \delta\_{\text{SCC}}(S, a) \cap P, \ C' = (\delta(H, a) \cap P) \nmid S',\\ \ast \ B' = \begin{cases} C' & \text{if } B^{\mathsf{A}} = \emptyset \text{ for } B^{\mathsf{A}} = \delta\_{\text{SCC}}(B, a),\\ B^{\mathsf{A}} & \text{otherwise, and} \end{cases} \quad \ast \ c = \begin{cases} \{\mathsf{O}\} & \text{if } B^{\mathsf{A}} = \mathsf{0},\\ \emptyset & \text{otherwise.} \end{cases} \end{array}$$

Moreover, in the case (, ) = ∅, then also contains ( ( ′′, ′′, ′′), { **0** }) where ′′ = ′ ∪ ′ and ′′ = ′ \ ′′ .

Intuitively, when (, ) ∩SCC (, ) = ∅, we make the following guess: (i) either the runs in all become safe (we move them to ) or (ii) there might be some unsafe runs (we keep them in ). Since the runs in are deterministic, the number of tracked runs in will not increase. Moreover, if all runs in are eventually safe, we are guaranteed to move all of them to at the right time point, e.g., the maximal time point where all runs are safe since the number of runs is fnite.

As mentioned above, is not accepted within if all runs over either (i) leave or (ii) become safe. In the context of the presented algorithm, this corresponds to (i) becoming empty infnitely often and (ii) (, ) never seeing an accepting transition. Then we only need to check if there exists an infnite sequence of macrostates ˆ = (0, 0, 0) . . . that emits **<sup>0</sup>** infnitely often.

## **Lemma 2.** *The partial algorithm* CSB *is correct.*

It is worth noting that when the given partition block contains all DACs of A, we can still use the construction above, while the construction in [28] only works on SDBAs. *Example 1.* In Fig. 1, we give an example of the run of our algorithm on the BA Aex . The BA contains three SCCs, one of them (the one containing ) non-accepting (therefore, it does not need to occur in any partition block). The partition block <sup>0</sup> contains a single DAC, so we can use algorithm CSB, and the partition block <sup>1</sup> contains a single accepting IWC, so we can use MH. The resulting ModCompl(CSB<sup>0</sup> , MH<sup>1</sup> , Aex ) uses two colours, **<sup>0</sup>** from CSB and **<sup>1</sup>** from MH. The acceptance condition is Inf( **<sup>0</sup>** ) ∧ Inf( **<sup>1</sup>** ). ⊓⊔

#### **4.3 Upper-bound for Elevator Automata Complementation**

We now give an upper bound on the size of the complement generated by our algorithm for elevator automata, which signifcantly improves the best previously known upper bound of O (16 ) [29] to O (4 ), the same as for SDBAs, which are a strict subclass of elevator automata [6] (we note that this upper bound cannot be obtained by a determinization-based algorithm, since determinization of SDBAs is in Ω(!) [17,40]).

**Theorem 2.** *Let* A *be an elevator automaton with states. Then there exists a BA with* O (4 ) *states accepting the complement of* L (A)*.*

*Proof (Sketch).* Let be all states in accepting IWCs, be all states in DACs, and be the remaining states, i.e., = ⊎ ⊎ . We make two partition blocks: <sup>0</sup> = and <sup>1</sup> = and use MH and CSB respectively as the partial algorithms, with macrostates of the form (, (0, 0), (1, 1, 1)). For each state ∈ , there are two options: either ∉ or ∈ . For each state ∈ , there are three options: (i) ∉ 0, (ii) ∈ <sup>0</sup> \ 0, or (iii) ∈ <sup>0</sup> ∩ 0. Finally, for each ∈ , there are four options: (i) ∉ 1∪1, (ii) ∈ 1, (iii) ∈ <sup>1</sup> \1, or (iv) ∈ 1∩1. Therefore, the total number of macrostates is 2 · 2 | | · 3 | | · 4 <sup>|</sup> <sup>|</sup> ∈ O (4 ) where the initial factor 2 is due to degeneralization from two to one colour (the two colours can actually be avoided by using our shared breakpoint optimization from Sec. 5.4). ⊓⊔

## **5 Optimizations of the Modular Construction**

In this section, we propose optimizations of the basic modular algorithm. In Sec. 5.1, we give a partial algorithm to complement initial partition blocks with DACs. Further, in Sec. 5.2, we propose the postponed construction allowing to use automata reduction on intermediate results. In Sec. 5.3, we propose the round-robin algorithm alleviating the problem with the explosion of the size of the Cartesian product of partial successors. In Sec. 5.4, we provide an optimization for partial algorithms that are based on the breakpoint construction, and, fnally, in Sec. 5.5, we show how to employ simulation to decrease the size of macrostates in the synchronous construction.

#### **5.1 Complementation of Initial Deterministic Partition Blocks**

Our frst optimization is an algorithm CoB for a subclass of partition blocks containing DACs. In particular, the condition CoB specifes that the partition block is deterministic and can be reached only deterministically in A (i.e., A after removing redundant states is deterministic). Then, we say that is an *initial deterministic* partition block. The algorithm is based on complementation of deterministic BAs into co-B ¨uchi automata.

The algorithm CoB is formalized below:

$$-\mathop{\mathsf{T}}^{\mathsf{CosBr}} = P \cup \{\mathsf{0}\}, \quad \mathsf{T} \text{init}^{\mathsf{CosBr}} = I \cap P, \quad \mathsf{CoLours}^{\mathsf{CosBr}} = \{\mathsf{Q}\}, \quad \mathsf{Acc}^{\mathsf{CosBr}} = \mathsf{Fin}(\overline{\mathsf{Q}}),$$

$$\begin{array}{c} \mathsf{SuccC}^{\mathsf{CosB}}(H, q, a) = \{ (q', \alpha) \} \text{ where} \\ \mathsf{o} \cdot q' = \begin{cases} r & \text{if } \delta(H, a) \cap P = \{ r \} \text{ and} \\ \varnothing & \text{otherwise}, \end{cases} \qquad \begin{array}{c} \mathsf{o} \ \alpha = \begin{cases} \{\mathsf{O}\} & \text{if } q \xrightarrow{a} q' \in F \text{ and} \\ \varnothing & \text{otherwise}. \end{cases} \end{array}$$

Intuitively, all runs reach deterministically, which means that over a word , at most one run can reach (so |InitCoB | = 1). Thus, we have |(, ) ∩ | = 1 for some ≥ 0 if there is a run over to , corresponding to (, ) ∩ = {} in the construction. To check whether is not accepted in , we only need to check whether the run from ∈ over visits accepting transitions only fnitely often. We give an example of complementation of a BA containing an initial deterministic partition block in [27].

**Lemma 3.** *The partial algorithm* CoB *is correct.*

### **5.2 Postponed Construction**

The modular synchronous construction from Sec. 3.1 utilizes the assumption that in the simultaneous construction of successors for each partition block over , if one partial macrostate does not have a successor over , then there will be no successor of the (, 1, . . . , ) macrostate in C as well. This is useful, e.g., for inclusion testing, where it is not necessary to generate the whole complement. On the other hand, if we need to generate the whole automaton, a drawback of the proposed modular construction is that each partial complementation algorithm itself may generate a lot of useless states. In this section, we propose the *postponed construction*, which complements the partition blocks (with their surrounding) independently and later combines the intermediate results to obtain the complement automaton for A. The main advantage of the postponed construction is that one can apply automata reduction (e.g., based on removing useless states or using simulation [13,18,1,9]) to decrease the size of the intermediate automata.

In the postponed construction, we use product-based BA intersection operation (i.e., for two TELAs B<sup>1</sup> and B2, a product automaton B<sup>1</sup> ∩ B<sup>2</sup> satisfying L (B<sup>1</sup> ∩ B2) = L (B1) ∩ L (B2)10). Further, we employ a function Red performing some languagepreserving reduction of an input TELA. Then, the postponed construction for an elevator automaton A with a partitioning 1, . . . , and a sequence Alg<sup>1</sup> , . . . , Alg where Alg is a partial complementation algorithm for , is defned as follows:

$$\operatorname{Postr}\operatorname{CoMor}(\operatorname{Alg}\_{P\_1}^1, \dots, \operatorname{Alg}\_{P\_n}^n, \mathcal{H}) = \bigcap\_{i=1}^n \operatorname{Red}\left(\operatorname{MonCoMor}(\operatorname{Alg}\_{P\_i}^i, \mathcal{H}\_{P\_i})\right). \tag{2}$$

The correctness of the construction is then summarized by the following theorem.

**Theorem 3.** *Let* A *be a BA,* 1, . . . , *be a partitioning of* A*, and* Alg<sup>1</sup> , . . . , Alg *be a sequence of partial complementation algorithms such that* Alg *is* correct *for . Then,* L (*PostpCompl*(Alg<sup>1</sup> 1 , . . . , Alg , A)) = Σ \ L (A)*.*

#### **5.3 Round-Robin Algorithm**

The proposed basic synchronous approach from Sec. 3.1 may sufer from the combinatorial explosion because the successors of a macrostate are given by the Cartesian product of all successors of the partial macrostates. To alleviate this explosion, we propose

<sup>10</sup> Alternatively, one might also avoid the product and generate linear-sized *alternating* TELA, but working with those is usually much harder and not used in practice.

a *round-robin* top-level algorithm. Intuitively, the round-robin algorithm actively tracks runs in only one partial complementation algorithm at a time (while other algorithms stay passive). The algorithm periodically changes the active algorithm to avoid starvation (the decision to leave the active state is, however, fully directed by the partial complementation algorithm). This can alleviate an explosion in the number of successors for algorithms that generate more than one successor (e.g., for rank-based algorithms where one needs to make a nondeterministic choice of decreasing ranks of states in order to be able to accept [34,21,48,10,24,29]; such a choice needs to be made only in the active phase while in the passive phase, the construction just needs to make sure that the run is consistent with the given ranking, which can be done deterministically).

The round-robin algorithm works on the level of *partial complementation roundrobin algorithms*. Each instance of the partial algorithm provides *passive types* to represent partial macrostates that are passive and *active types* to represent currently active partial macrostates. In contrast to the basic partial complementation algorithms from Sec. 3.1, which provide only a single successor function, the round-robin partial algorithms provide several variants of them. In particular, SuccPass returns (passive) successors of a passive partial macrostate, Lift gives all possible active counterparts of a passive macrostate, and SuccAct returns successors of an active partial macrostate. If SuccAct returns a partial macrostate of the passive type, the round-robin algorithm promotes the next partial algorithm to be the active one. For instance, in the round-robin version of CSB, the passive type does not contain the breakpoint and only checks that safe runs stay safe, so it is deterministic. Due to space limitations, we give a formal defnition and more details about the round-robin algorithm in [27].

#### **5.4 Shared Breakpoint**

The partial complementation algorithms CSB and MH (and later RNK defned in Sec. 6) use a breakpoint to check whether the runs under inspection are accepting or not. As an optimization, we consider merging of breakpoints of several algorithms and keeping only a single breakpoint for all supported algorithms. The top-level algorithm then needs to manage only one breakpoint and emit a colour only if this sole breakpoint becomes empty. This may lead to a smaller number of generated macrostates since we synchronize the breakpoint sampling among several algorithms. The second beneft is that this allows us to generate fewer colours (in the case of elevator automata complemented using algorithms CSB and MH, we get only one colour).

#### **5.5 Simulation Pruning**

Our construction can be further optimized by a simulation (or other compatible) relation for pruning macrostates.11 A simulation is, broadly speaking, a relation ≼ ⊆ × implying language inclusion of states, i.e., ∀, ∈ : ≼ =⇒ L (A []) ⊆ L (A []). Intuitively, our optimization allows to remove a state from a macrostate if there is also a state in such that (i) ≼ , (ii) is not reachable from , and (iii) is smaller than in an arbitrary total order over (this serves as a tie-breaker for

<sup>11</sup> This optimization can be seen as a generalization of the simulation-based pruning techniques that appeared, e.g., in [41,28] in the context of concrete determinization/complementation procedures. Here, we generalize the technique to all procedures that are based on run tracking.

simulation-equivalent mutually unreachable states). The reason why can be removed is that its behaviour can be completely mimicked by . In our construction, we can then, roughly speaking, replace each call to the functions (, ) and (, ), for a set of states , by pr ((, )) and pr ( (, )) respectively in each partial complementation algorithm, as well as in the top-level algorithm, where pr () is obtained from by pruning all eligible states. The details are provided in [27].

## **6 Modular Complementation of Non-Elevator Automata**

A non-elevator automaton A contains at least one NAC, besides possibly other IWCs or DACs. To complement A in a modular way, we apply the techniques seen in Sec. 4 to its DACs and IWCs, while for its NACs we resort to a general complementation algorithm Alg. In theory, rank- [34], slice- [32], Ramsey- [50], subset-tuple- [2], and determinization- [46] based complementation algorithms adapted to work on a single partition block instead of the whole automaton are all valid instantiations of Alg. Below, we give a high-level description of two such algorithms: rank- and determinization-based.

*Rank-based partial complementation algorithm.* Working on each NAC independently benefts the complementation algorithm even if the input BA contains only NACs. For instance, in rank-based algorithms [34,21,48,33,10,24,29], the fact whether all runs of A over a given -word are non-accepting is determined by *ranks* of states, given by the so-called *ranking functions*. A ranking function is a (partial) function from to . The main idea of rank-based algorithms is the following: (i) every run is initially nondeterministically assigned a rank, (ii) ranks can only decrease along a run, (iii) ranks need to be even every time a run visits an accepting transition, and (iv) the complement automaton accepts if all runs eventually get trapped in odd ranks12. In the standard rank-based procedure, the initial assignment of ranks to states in (i) is a function ⇀ {0, . . . , 2 − 1} for = ||. Using our framework, we can, however, signifcantly restrict the considered ranks in a partition block to only ⇀ {0, . . . , 2 − 1} for = || (here, it makes sense to use partition blocks consisting of single SCCs). One can further reduce the considered ranks using the techniques introduced in, e.g., [24,29].

In order to adapt the rank-based construction as a partial complementation algorithm RNK in our framework, we need to extend the ranking functions by a fresh "box state" representing states outside the partition block. The ranking function then uses to represent ranks of runs newly coming into the partition block. The box-extension also requires to change the transition in a way that always represents reachable states from the outside. We provide the details of the construction, which includes the MaxRank optimization from [24], in [27].

*Determinization-based partial complementation algorithm.* In [52,29] we can see that determinization-based complementation is also a good instantiation of Alg in practice, so, we also consider the standard Safra-Piterman determinization [46,43,45] as a choice of Alg for complementing NACs. Determinization-based algorithms use a layered subset construction to organize all runs over an -word . The idea is to identify a subset ⊆ of reachable states that occur infnitely often along reading such that between every two occurrences of , we have that (i) every state in the second occurrence of can be reached

<sup>12</sup> Since we focus on intuition here, we use runs rather than the directed acyclic graphs of runs.

Table 1: Statistics for our experiments. The column **unsolved** classifes unsolved instances by the form *timeouts : out of memory : other failures*. For the cases of VBS we provide just the number of unsolved cases. The columns **states** and **runtime** provide *mean : median* of the number of states and runtime, respectively.


by a state in the frst occurrence of and (ii) every state in the second occurrence is reached by a state in the frst occurrence while seeing an accepting transition. According to Konig's lemma, there must then be an accepting run of ¨ A over .

The construction initially maintains only one set : the set of reachable states. Since as defned does not necessarily need to be , every time there are runs visiting accepting transitions, we create a new subset for those runs and remember which subset is coming from. This way, we actually organize the current states of all runs into a tree structure and do subset construction in parallel for the sets in each tree node. If we fnd a tree node whose labelled subset, say ′ , is equal to the union of states in its children, we know the set ′ satisfes the condition above and we remove all its child nodes and emit a good event. If such good event happens infnitely often, it means that ′ also occurs infnitely often. So in complementation, we only need to make sure those good events only happen for fnitely many times. Working on each NAC separately also benefts the determinization-based approach since the number of possible trees will be less with smaller number of reachable states. Following the idea of [37], to adapt for the construction as the partial complementation algorithm, we put all the newly coming runs from other partition blocks in a newly created node without a parent node. In this way, we actually maintain a forest of trees for the partial complementation construction. We denote the determinization-based construction as DET; cf. [37] for details.

## **7 Experimental Evaluation**

To evaluate the proposed approach, we implemented it in a prototype tool Kofola [25] (written in C++) built on top of Spot [16] and compared it against COLA [37], Ranker [28] (v. 2), Seminator [5] (v. 2.0), and Spot [15,16] (v. 2.10.6), which are the state of the art in BA complementation [29,28,37]. Due to space restrictions, we give results for only two instantiations of our framework: Kofola and Kofola. Both instantiations use MH for IWCs, CSB for DACs, and DET for NACs. The partitioning selection algorithm merges all IWCs into one partition block, all DACs into one partition block, and keeps all NACs separate. Simulation-based pruning from Sec. 5.5 is turned on, and round-robin from Sec. 5.3 is turned of (since the selected algorithms are quite deterministic). Kofola employs the *synchronous* and Kofola employs the *postponed* strategy. We also consider the Virtual Best Solver (VBS), i.e., a virtual tool that would choose the best solver for each single benchmark among all tools (VBS+) and among all tools except both versions of Kofola (VBS−). We ran our experiments on an Ubuntu 20.04.4 LTS system running on a desktop machine with 16 GiB RAM and an

Fig. 2: Scatter plots comparing the numbers of states generated by the tools.

Intel 3.6 GHz i7-4790 CPU. To constrain and collect statistics about the executions of the tools, we used BenchExec [3] and imposed a memory limit of 12 GiB and a timeout of 10 minutes; we used Spot to cross-validate the equivalence of the automata generated by the diferent tools. An artifact reproducing our experiments is available as [26].

As our data set, we used 39,837 BAs from the automata-benchmarks repository [36] (used before by, e.g., [29,28,37]), which contains BAs from the following sources: (i) randomly generated BAs used in [52] (21,876 BAs), (ii) BAs obtained from LTL formulae from the literature and randomly generated LTL formulae [5] (3,442 BAs), (iii) BAs obtained from Ultimate Automizer [11] (915 BAs), (iv) BAs obtained from the solver for frst-order logic over Sturmian words Pecan [31] (13,216 BAs), (v) BAs obtained from an S1S solver [23] (370 BAs), and (vi) BAs from LTL to SDBA translation [49] (18 BAs). From these BAs, 23,850 are deterministic, 6,147 are SDBAs (but not deterministic), 4,105 are elevator (but not SDBAs), and 5,735 are the rest.

In Table 1 we present an overview of the outcomes. Despite being a prototype, Kofola can already complement a large portion of the input automata, with very few cases that can be complemented successfully only by Spot or COLA. Regarding the mean number of states, Kofola has the **least mean value** from all tools (except Ranker, which, however, had 1,000 unsolved cases) Moreover, Kofola **signifcantly decreased the mean number of states** when included into the VBS: from 96 to 78! We consider this to be a strong validation of the usefulness of our approach. Regarding the runtime, both versions of Kofola are rather similar; Kofola is just slightly slower than Spot and COLA but much faster than both Ranker and Seminator (cf. [27]).

In Fig. 2 we present a comparison of the number of states generated by Kofola and other tools; we omit VBS<sup>+</sup> since the corresponding plot can be derived from the one for VBS<sup>−</sup> (since Ranker and Seminator only output BAs, we compare the sizes of outputs transformed into BAs for all tools to be fair). In the plots, the number of benchmarks represented by each mark is given by its colour; a mark above the diagonal means that Kofola generated a BA smaller than the other tool while a mark on the top border means that the other tool failed while Kofola succeeded, and symmetrically for the bottom part and the right-hand border. Dashed lines represent the maximum number of states generated by one of the tools in the plot, axes are logarithmic.

From the results, Kofola clearly dominates state-of-the-art tools that are not based on SCC decomposition (Ranker, Spot, Seminator). The outputs are quite comparable to COLA, which also uses SCC decomposition and can be seen as an instantiation of our framework. This supports our intuition that working on the single SCCs helps in reducing the size of the fnal automaton, confrming the validity of our modular mix-and-match B¨uchi comple-

mentation approach. Lastly, in the fgure in the right we compare our algorithm for elevator automata with the one in Ranker (the only other tool with a dedicated algorithm for this subclass). Our new algorithm clearly dominates the one in Ranker.

## **8 Related Work**

To the best of our knowledge, we provide the *frst general framework* where one can plug-in diferent BA complementation algorithms while taking advantage of the specifc structure of SCCs. We will discuss the diference between our work and the literature.

The breakpoint construction [42] was designed to complement BAs with only IWCs, while our construction treats it as a partial complementation procedure for IWCs and difers in the need to handle incoming states from other partition blocks. The NCSB family of algorithms [6,11,5,28] for SDBAs do not work when there are nondeterministic jumps between DACs; they can, however, be adapted as partial procedures for complementing DACs in our framework, cf. Sec. 4.2. In [29], a deelevation-based procedure is applied to elevator automata to obtain BAs with a fxed maximum rank of 3, for which a rank-based construction produces a result of the size in O (16 ). In our work, we exploit the structure of the SCCs much more to obtain an exponentially better upper bound of O (4 ) (the same as for SDBAs). The upper bound O (4 ) for complementing unambiguous BAs was established in [39], which is orthogonal to our work, but seems to be possible to incorporate into our framework in the future.

There is a huge body of work on complementation of general BAs [8,50,7,34,21,22,10,24,29,48,2,46,43,45,5,52,32,53,19,20]; all of them work on the whole graph structure of the input BAs. Our framework is general enough to allow including all of them as partial complementation procedures for NACs. On the contrary, our framework does not directly allow (at least in the synchronous strategy) to use algorithms that *do not* work on the structure of the input BA, such as the learning-based complementation algorithm from [38]. The recent determinization algorithm from [37], which serves as our inspiration, also handles SCCs separately (it can actually be seen as an instantiation of our framework). Our current algorithm is, however, more fexible, allowing to mix-and-match various constructions, keep SCCs separate or merge them into partition blocks, and allows to obtain the complexity O (4 ), while [37] only allowed O (!) (which is tight since SDBA determinization is in Ω(!) [17,40]).

Regarding the tool Spot [15,16], it should not be perceived as a single complementation algorithm. Instead, Spot should be seen as a highly engineered platform utilizing breakpoint construction for inherently weak BAs, NCSB [6,11] for SDBAs, and determinization-based complementation [46,43,45] for general BAs, while using many other heuristics along the way. Seminator uses semi-determinization [14,4,5] to make sure the input is an SDBA and then uses NCSB [6,11] to compute the complement.

## **9 Conclusion and Future Work**

We have proposed a general framework for BA complementation where one can plug-in diferent partial complementation procedures for SCCs by taking advantage of their specifc structure. Our framework not only obtains an exponentially better upper bound for elevator automata, but also complements existing approaches well. As shown by the experimental results (especially for the VBS), our framework signifcantly improves the current portfolio of complementation algorithms.

We believe that our framework is an ideal testbed for experimenting with diferent BA complementation algorithms, e.g., for the following two reasons: (i) One can develop an efcient complementation algorithm that only works for a quite restricted sub-class of BAs (such as the algorithm for initial deterministic SCCs that we showed in Sec. 5.1) and the framework can leverage it for complementation of all BAs that contain such a substructure. (ii) When one tries to improve a general complementation algorithm, they can focus on complementation of the structurally hard SCCs (mainly the nondeterministic accepting SCCs) and do not need to look for heuristics that would improve the algorithm if there were some easier substructure present in the input BA (as was done, e.g., in [29]). From how the framework is defned, it immediately ofers opportunities for being used for on-the-fy BA *language inclusion* testing, leveraging the partial complementation procedures present. Finally, we believe that the framework also enables new directions for future research by developing smart ways, probably based on machine learning, of selecting which partial complementation procedure should be used for which SCC, based on their features. In future, we want to incorporate other algorithms for complementation of NACs, and identify properties of SCCs that allow to use more efcient algorithms (such as unambiguous NACs [39]). Moreover, it seems that generalizing the Delayed optimization from [24] on the top-level algorithm could also help reduce the state space.

*Acknowledgements.* We thank the reviewers for their useful remarks that helped us improve the quality of the paper and Alexandre Duret-Lutz for sharing a Ti*k*Z package for beautiful automata. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no. XDA0320000); the National Natural Science Foundation of China (grants no. 62102407 and 61836005); the CAS Project for Young Scientists in Basic Research (grant no. YSBR-040); the Engineering and Physical Sciences Research Council (grant no. EP/X021513/1); the Czech Ministry of Education, Youth and Sports project LL1908 of the ERC.CZ programme; the Czech Science Foundation project GA23-07565S; and the FIT BUT internal project FIT-S-23-8151.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant no. 101008233.

*Data Availability Statement.* An environment with the tools and data used for the experimental evaluation in the current study is available in the following Zenodo repository: https://doi.org/10.5281/zenodo.7505210.

## **References**


Informatik (2022). https://doi.org/10.4230/LIPIcs.CSL.2022.24, https://doi. org/10.4230/LIPIcs.CSL.2022.24


(2019). https://doi.org/10.1007/978-3-030-31784-3\_18, https://doi.org/10. 1007/978-3-030-31784-3\_18


270 V. Havlena et al.

55. Yan, Q.: Lower bounds for complementation of omega-automata via the full automata technique. Log. Methods Comput. Sci. **4**(1) (2008). https://doi.org/10.2168/LMCS-4(1: 5)2008, https://doi.org/10.2168/LMCS-4(1:5)2008

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

main.eps

## Validating Streaming JSON Documents with Learned VPAs<sup>∗</sup>

V´eronique Bruy`ere<sup>1</sup> , Guillermo A. P´erez<sup>2</sup> , and Ga¨etan Staquet<sup>1</sup>,2()

<sup>1</sup> University of Mons (UMONS), Mons, Belgium {veronique.bruyere,gaetan.staquet}@umons.ac.be <sup>2</sup> University of Antwerp (UAntwerp) – Flanders Make, Antwerp, Belgium guillermo.perez@uantwerpen.be

Abstract. We present a new streaming algorithm to validate JSON documents against a set of constraints given as a JSON schema. Among the possible values a JSON document can hold, objects are unordered collections of key-value pairs while arrays are ordered collections of values. We prove that there always exists a visibly pushdown automaton (VPA) that accepts the same set of JSON documents as a JSON schema. Leveraging this result, our approach relies on learning a VPA for the provided schema. As the learned VPA assumes a fxed order on the key-value pairs of the objects, we abstract its transitions in a special kind of graph, and propose an efcient streaming algorithm using the VPA and its graph to decide whether a JSON document is valid for the schema. We evaluate the implementation of our algorithm on a number of random JSON documents, and compare it to the classical validation algorithm.

Keywords: Visibly pushdown automata · JSON · streaming validation

## 1 Introduction

JavaScript Object Notation (JSON) has overtaken XML as the de facto standard data-exchange format, in particular for web applications. JSON documents are easier to read for programmers and end users since they only have arrays and objects as structured types. Moreover, in contrast to XML, they do not include named open and end tags for all values, but open and end tags (braces actually) for arrays and objects only. JSON schema [13] is a simple schema language that allows users to impose constraints on the structure of JSON documents.

In this work, we are interested in the validation of streaming JSON documents against JSON schemas. Several previous results have been obtained about the formalization of XML schemas and the use of formal methods to validate XML documents (see, e.g., [5,15,16,18,24,25]). Recently, a standard to formalize JSON schemas has been proposed and (hand-coded) validation tools for such schemas can be found online [13]. Pezoa et al, in [19], observe that the standard

<sup>∗</sup>This work was supported by the Belgian FWO "SAILor" project (G030020N). Ga¨etan Staquet is a research fellow (Aspirant) of the Belgian F.R.S.-FNRS.

© The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 271–289, 2023. https://doi.org/10.1007/978-3-031-30823-9 14

of JSON documents is still evolving and that the formal semantics of JSON schemas is also still changing. Furthermore, validation tools seem to make diferent assumptions about both documents and schemas. The authors of [19] carry out an initial formalization of JSON schemas into formal grammars from which they are able to construct a batch validation tool from a given JSON schema.

In this paper, we rely on the formalization work of [19] and propose a streaming algorithm for validating JSON documents against JSON schemas. To our knowledge, this is the frst JSON validation algorithm that is streaming. For XML, works that study streaming document validation base such algorithms on the construction of some automaton (see, e.g., [25], for XML). In [7], we frst experimented with one-counter automata for this purpose. We submit that visibly-pushdown automata (VPAs) are a better ft for this task — this is in line with [15], where the same was proposed for streaming XML documents. In contrast to one-counter automata,<sup>3</sup> we show that VPAs are expressive enough to capture the language of JSON documents satisfying any JSON schema.

More importantly, we explain that active learning `a la Angluin [4] is a good alternative to the automatic construction of such a VPA from the formal semantics of a given JSON schema. This is possible in the presence of labeled examples or a computer program that can answer membership and (approximate) equivalence queries about a set of JSON documents. This learning approach has two advantages. First, we derive from the learned VPA a streaming validator for JSON documents. Second, by automatically learning an automaton representation, we circumvent the need to write a schema and subsequently validate that it represents the desired set of JSON documents. Indeed, it is well known that one of the highest bars that users have to clear to make use of formal methods is the efort required to write a formal specifcation, in this case, a JSON schema.

Contributions. We present a VPA active learning framework to achieve what was mentioned above — though we fx an order on the keys appearing in objects. The latter assumption helps our algorithm learn faster. Secondly, we show how to bootstrap the learning algorithm by leveraging existing validation and documentgeneration tools to implement approximate equivalence checks. Thirdly, we describe how to validate streaming documents using our fxed-order learned automata — that is, our algorithm accepts other permutations of keys, not just the one encoded into the VPA. Finally, we present an empirical evaluation of our learning and validation algorithms, implemented on top of LearnLib [17].

All contributions, while complementary, are valuable in their own right. First, our learning algorithm for VPAs is a novel gray-box extension of TTT [9] that leverages side information about the language of all JSON documents. Second, our validation algorithm that uses a fxed-order VPA is novel and can be applied regardless of whether the automaton is learned or constructed from a schema. For the validation algorithm, we developed the concept of key graph, which allows us to efciently realize the validation no matter the key-value order in the docu-

<sup>3</sup>By nesting objects and arrays, we obtain a set of JSON documents encoding {a n b <sup>m</sup>c <sup>m</sup>d n | n, m ∈ N}, a context-free language that requires two counters.

ment, and might be of independent interest for other JSON-analysis applications using VPAs. Finally, we implemented our own batch validator to facilitate approximating equivalence queries as required by our learning algorithm. Both the new validator and the equivalence oracle are efcient, open-source, and easy to modify. We strongly believe the latter can be re-used in similar projects aiming to learn automata representations of sets of JSON documents.

A long version of this work is on arXiv: https://arxiv.org/abs/2211.08891.

## 2 Visibly Pushdown Languages

First, we recall the defnition VPAs [3] and state some of their properties. We also recall how they can be actively learned following Angluin's approach [4].

Visibly Pushdown Automata An alphabet Σ is a fnite set whose elements are called symbols. A word w over Σ is a fnite sequence of symbols from Σ, with the empty word denoted by ε. The length of w is denoted |w|; the set of all words, Σ<sup>∗</sup> . Given two words v, w ∈ Σ<sup>∗</sup> , v is a prefx (resp. sufx ) of w if there exists u ∈ Σ<sup>∗</sup> such that w = vu (resp. w = uv), and v is a factor of w if there exist u, u′ ∈ Σ<sup>∗</sup> such that w = uvu′ . Given L ⊆ Σ<sup>∗</sup> , called a language, we denote by Pref(L) (resp. Suf(L)) the set of prefxes (resp. sufxes) of words of L. Given a set Q, we write I<sup>Q</sup> for the identity relation {(q, q) | q ∈ Q} on Q.

VPA [3] are particular pushdown automata that we recall in this section. The pushdown alphabet, denoted Σ˜ = (Σc, Σr, Σi), is partitioned into pairwise disjoint alphabets Σc, Σr, Σ<sup>i</sup> such that Σ<sup>c</sup> (resp. Σr, Σi) is the set of call symbols (resp. return symbols, internal symbols). In this paper, we work with the particular alphabet of return symbols Σ<sup>r</sup> = {a¯ | a ∈ Σc}. For any such Σ˜, we denote by Σ the alphabet Σ<sup>c</sup> ∪ Σ<sup>r</sup> ∪ Σ<sup>i</sup> . Given a pushdown alphabet Σ˜, the set WM(Σ˜) of well-matched words over Σ˜ is defned:


Also, the call/return balance function β : Σ<sup>∗</sup> → Z is defned as β(ε) = 0 and β(ua) = β(u) + x with x being 1, −1, or 0 if a is in Σc, Σr, or Σ<sup>i</sup> respectively. In particular, for all w ∈ WM(Σ˜), we have β(u) ≥ 0 for each prefx u of w and β(u) ≤ 0 for each sufx u of w. Finally, the depth d(w) of a well-matched word w is equal to max{β(u) | u ∈ Pref({w})}, that is, the maximum number of unmatched call symbols among the prefxes of w.

Defnition 1. A visibly pushdown automaton (VPA) over a pushdown alphabet Σ˜ is a tuple (Q, Σ, Γ, δ, Q ˜ <sup>I</sup> , Q<sup>F</sup> ) where Q is a fnite non-empty set of states, Q<sup>I</sup> ⊆ Q is a set of initial states, Q<sup>F</sup> ⊆ Q is a set of fnal states, Γ is a stack alphabet, and δ is a fnite set of transitions of the form δ = δ<sup>c</sup> ∪ δ<sup>r</sup> ∪ δ<sup>i</sup> where δ<sup>c</sup> ⊆ Q × Σ<sup>c</sup> × Q × Γ is the set of call transitions, δ<sup>r</sup> ⊆ Q × Σ<sup>r</sup> × Γ × Q is the set of return transitions, and δ<sup>i</sup> ⊆ Q × Σ<sup>i</sup> × Q is the set of internal transitions. The size of A is denoted by |Q|, and its number of transitions by |δ|.

Let us describe the transition system T<sup>A</sup> of a VPA A whose vertices are confgurations. A confguration is a pair ⟨q, σ⟩ where q ∈ Q is a state and σ ∈ Γ <sup>∗</sup> a stack content. A confguration is initial (resp. fnal) if q ∈ Q<sup>I</sup> (resp. q ∈ Q<sup>F</sup> ) and σ = ε. For a ∈ Σ, we write ⟨q, σ⟩ <sup>a</sup>−→ ⟨q ′ , σ′ ⟩ in T<sup>A</sup> if there is either a call transition (q, a, q′ , γ) ∈ δ<sup>c</sup> verifying σ ′ = γσ, <sup>4</sup> or a return transition (q, a, γ, q′ ) ∈ δ<sup>r</sup> verifying σ = γσ′ , or an internal transition (q, a, q′ ) ∈ δ<sup>i</sup> such that σ ′ = σ.

The transition relation of T<sup>A</sup> is extended to words in the usual way. We say that A accepts a word w ∈ Σ<sup>∗</sup> if there exists a path in T<sup>A</sup> from an initial confguration to a fnal confguration that is labeled by w. The language of A, denoted by L(A), is defned as L(A) = {w ∈ Σ<sup>∗</sup> | ∃q ∈ Q<sup>I</sup> , ∃q ′ ∈ Q<sup>F</sup> ,⟨q, ε⟩ <sup>w</sup>−→ ⟨q ′ , ε⟩}, i.e., the set of all words accepted by A. Any language accepted by some VPA is a visibly pushdown language (VPL). Notice that such a language is composed of well-matched words only.<sup>5</sup> Given a VPA A over Σ˜, the reachability relation Reach<sup>A</sup> of A is Reach<sup>A</sup> = {(q, q′ ) ∈ Q<sup>2</sup> | ∃w ∈ WM(Σ˜),⟨q, ε⟩ <sup>w</sup>−→ ⟨q ′ , ε⟩}.

Finally, we say that p ∈ Q is a bin state if there exists no path in T<sup>A</sup> of the form ⟨q, ε⟩ <sup>w</sup>−→ ⟨p, σ⟩ w ′ −→ ⟨q ′ , ε⟩ with q ∈ Q<sup>I</sup> and q ′ ∈ Q<sup>F</sup> . If a VPA A has bin states, those states can be removed from Q as well as the transitions containing bin states without modifying the accepted language.

Minimal Deterministic VPAs Given a VPA A = (Q, Σ, Γ, δ, Q ˜ <sup>I</sup> , Q<sup>F</sup> ), we say that it is deterministic (det-VPA) if |Q<sup>I</sup> | = 1 and A does not have two distinct transitions with the same left-hand side. By left-hand side, we mean (q, a) for a call transition (q, a, q′ , γ) ∈ δ<sup>c</sup> or an internal transition (q, a, q′ ) ∈ δ<sup>i</sup> , and (q, a, γ) for a return transition (q, a, γ, q′ ) ∈ δr.

Theorem 1 ( [3,32]). For any VPA A over Σ˜, one can construct a det-VPA B over Σ˜ such that L(A) = L(B). Moreover, the size of B is in O(2<sup>|</sup>Q<sup>|</sup> 2 ) and the size of its stack alphabet is in O(|Σc| · 2 |Q| 2 ).

Proof. Let us briefy recall this construction. Let A = (Q, Σ, Γ, δ, Q ˜ <sup>I</sup> , Q<sup>F</sup> ). The states of B are subsets R of the reachability relation Reach<sup>A</sup> of A and the stack symbols of B are of the form (R, a) with R ⊆ Reach<sup>A</sup> and a ∈ Σc. Let w = u1a1u2a<sup>2</sup> . . . unanun+1 be such that n ≥ 0 and u<sup>i</sup> ∈ WM(Σ˜), a<sup>i</sup> ∈ Σ<sup>c</sup> for all i. That is, we decompose w in terms of its unmatched call symbols. Let R<sup>i</sup> be equal to {(p, q) | ⟨p, ε⟩ <sup>u</sup><sup>i</sup> −→ ⟨q, ε⟩} for all i. Then after reading w, the det-VPA B has its current state equal to Rn+1 and its stack containing (Rn, an). . .(R2, a2)(R1, a1). Assume we are reading the symbol a after w, then B performs the following transition from Rn+1: (1) if a ∈ Σc, then push (Rn+1, a) on the stack and go to the state R = I<sup>Q</sup> (a new unmatched call symbol is read); (2) if a ∈ Σ<sup>i</sup> , then go to the state R = {(p, q) | ∃(p, p′ ) ∈ Rn+1,(p ′ , a, q) ∈ δi} (un+1 is extended to the well-matched word un+1a); (3) if a ∈ Σr, then pop (Rn, an) from the stack if ¯a<sup>n</sup> = a, and go to the state

$$R = \{ (p, q) \mid \exists (p, p') \in R\_n, (p', a\_n, r', \gamma) \in \delta\_c, (r', r) \in R\_{n+1}, (r, a, \gamma, q) \in \delta\_r \} $$

<sup>4</sup>The stack symbol γ is pushed on the left of σ.

<sup>5</sup>The original defnition of VPA [3] allows acceptance of ill-matched words.

(the call symbol a<sup>n</sup> is matched with the return symbol a = ¯an, leading to the well-matched word unanun+1a). Finally the initial state of B is I<sup>Q</sup><sup>I</sup> and its fnal states are sets R containing some (p, q) with p ∈ Q<sup>I</sup> and q ∈ Q<sup>F</sup> . ⊓⊔

Though a VPL L in general does not have a unique minimal det-VPA A accepting L, imposing the following subclass leads to a unique minimal acceptor.

Defnition 2 ( [2, 9]). A 1-module single entry VPA<sup>6</sup> (1-SEVPA) is a det-VPA A = (Q, Σ, Γ, δ, Q ˜ <sup>I</sup> = {q0}, Q<sup>F</sup> ) such that its stack alphabet Γ is equal to Q × Σc, and all its call transitions (q, a, q′ , γ) ∈ δ<sup>c</sup> are such that q ′ = q<sup>0</sup> and γ = (q, a).

Theorem 2 ( [2]). For any VPL L, there exists a unique minimal (with regards to the number of states) 1-SEVPA accepting L, up to a renaming of the states.<sup>7</sup>

Learning VPAs Let us recall the concept of learning a deterministic fnite automaton (DFA), as introduced in [4]. Let L be a regular language over an alphabet Σ. The task of the learner is to construct a DFA H such that L(H) = L by interacting with the teacher. The two possible types of interactions are membership queries (does w ∈ Σ<sup>∗</sup> belong to L?), and equivalence queries (does the DFA H accept L?). For the latter type, if the answer is negative, the teacher also provides a counterexample, i.e., a word w such that w ∈ L ⇔ w /∈ L(H). The so-called L <sup>∗</sup> algorithm of [4] learns at least one representative per equivalence class of the Myhill-Nerode congruence of L [8] from which the minimal DFA D accepting L is constructed. This learning process terminates and it uses a polynomial number of membership and equivalence queries in the size of D, and in the length of the longest counterexample returned by the teacher [4].

In [9], an extension of Angluin's learning algorithm is given for VPLs. The Myhill-Nerode congruence for regular languages is extended to VPLs as follows. Given a pushdown alphabet Σ˜ and a VPL L over Σ˜, we consider the set of context pairsCP(Σ˜) = {(u, v) ∈ (WM(Σ˜) · Σc) ∗ × Suf(WM(Σ˜)) | β(u) = −β(v)}, and we defne the equivalence relation ≃L⊆ WM(Σ˜) × WM(Σ˜) [2, 9] such that w ≃<sup>L</sup> w ′ if and only if ∀(u, v) ∈ CP(Σ˜), uwv ∈ L ⇔ uw′v ∈ L. The minimal 1-SEVPA accepting L as described in Theorem 2 is constructed from ≃<sup>L</sup> such that its states are the equivalence classes of ≃L.

Theorem 3 ( [9]). Let L be a VPL over Σ˜ and n be the index of ≃L. queries and a number of membership queries polynomial in n, |Σ|, and log ℓ, where ℓ is the length of the longest counterexample returned by the teacher.

The learning process designed in [9] extends to VPLs the Ttt algorithm proposed in [10] for regular languages. Ttt improves the efciency of the L <sup>∗</sup> algorithm by eliminating redundancies in counterexamples provided by the teacher.

<sup>6</sup>The defnitions of 1-SEVPA in [2] and [9] difer slightly. We follow the one in [9].

<sup>7</sup>This 1-SEVPA may be exponentially bigger than the size of a VPA accepting L.

## 3 JSON Format

In this section, we describe JSON documents [6] and JSON schemas [13] that impose some constraints on the structure of JSON documents. We also present the abstractions we make for the purpose of this paper.

JSON Documents We describe the structure of JSON documents. Our presentation is inspired by [19], though some details are skipped for readability (see [14] for a full description). The JSON format defnes diferent types of JSON values:


In this work, JSON documents are supposed to be objects.<sup>8</sup> One can use JSON pointers to navigate through a document, e.g., if J is an object and k is a key, then J[k] is the value v such that the key-value pair k :v appears in J.

In this paper, we consider somewhat abstract JSON documents. We see JSON documents as well-matched words over the pushdown alphabet Σ˜ JSON that we describe hereafter. We abstract all string values as s, and all numbers as n (as i when they are integers). We denote by ΣpVal = {true, false, null, s, n, i} the alphabet composed of the six primitive values. Concerning the key-value pairs appearing in objects, each key together with the symbol ":" following the key is abstracted as an alphabet symbol k. We assume knowledge of a fnite alphabet Σkey of keys. We defne the pushdown alphabet Σ˜ JSON = (Σc, Σr, Σi) with Σ<sup>i</sup> = Σkey∪ΣpVal∪{#}, where # is used in place of the comma; Σ<sup>c</sup> = {≺, ⊏}, where ≺ (resp. ⊏) is used in place of "{" (resp. "["); and Σ<sup>r</sup> = {≻, ⊐}, with ≺ = ≻ and ⊏ = ⊐. We denote by ΣJSON the set Σ<sup>c</sup> ∪ Σ<sup>r</sup> ∪ Σ<sup>i</sup> .

Example 1. An example of a JSON document is given in Listing 1. We can see that this document is an object containing three keys: "title", whose associated value is a string value; "keywords", whose value is an array containing string values; and "conf", whose value is an object. This inner object contains two keys: "name", whose value is a string value; "year", whose value is an integer. The pointer J[conf][name], where J is the root of the document, retrieves the value "TACAS". The JSON document is abstracted as the word ≺k1s#k2⊏s#s#s⊐# k3≺k4s#k5i≻≻ ∈ WM(Σ˜ JSON) where Σkey contains the keys k<sup>i</sup> , i ∈ {1, . . . , 5}.

<sup>8</sup> In [6], a JSON document can be any JSON value and duplicated keys are allowed inside objects. In this paper, we follow what is commonly used in practice: JSON documents are objects, and keys are pairwise distinct inside objects.

```
1 { " title ": " Validating Streaming JSON Documents with Learned VPAs ",
2 " keywords ": [" VPA ", " JSON documents ", " streaming validation "],
3 " conf ": { " name ": " TACAS ", " year ": 20 23 }
4 }
```
Listing 1: A JSON document.

```
1 { " type ": " object ",
2 " required ": [" title ", " conf "],
3 " properties ": {
4 " title ": { " type ": " string " },
5 " keywords ": { " type ": " array ", " items ": { " type ": " string " } },
6 " conf ": {
7 " type ": " object ",
8 " required ": [" name ", " year "],
9 " properties ":{ " name ":{" type ": " string "}," year ":{" type ": " integer "}}}}}
```
Listing 2: A JSON schema.

JSON Schemas A JSON schema can impose some constraints on JSON documents by specifying any of the types of JSON values that appear in those documents. We say that a JSON document satisfes (or is valid for) the schema if it verifes the constraints imposed by this schema. We denote by L(S) the set of documents that are valid for S. In this section, we give a simplifed presentation of JSON schemas and refer to [13] for a complete description and to [19] for a formalization (i.e. a formal grammar with its syntax and semantics).

A JSON schema is itself a JSON document that uses several keywords that help shape and restrict the set of JSON documents that this schema specifes. As we abstract JSON documents, JSON schemas we work on are also abstracted. We do not consider the restrictions that can be imposed on string values and numbers, for instance. We give here a few examples. See [13] for more details.


Example 2. The schema from Listing 2 describes objects that can have three keys: "title", whose associated value must be a string value; "keywords", an array of strings; and "conf", an object. Among these, "title" and "conf" are required. The JSON document of Example 1 satisfes this JSON schema.

Under these abstractions, we can always construct a VPA that accept the same set of JSON documents than a schema S, as shown in the following theorem. We also extend this construction to the case where we fx an order < on Σkey and consider the set L<(S) of documents valid for S whose key order inside objects respects this order <. The main idea of the proof is to defne a formalism of JSON schemas as extended context-free grammars, and show that we can construct a VPA from such a grammar.

Theorem 4. Let S be a JSON schema. Then, there exists a VPA A such that L(A) is the set L(S) of documents valid with regards to S. Moreover, for any order < of Σkey, there exists a VPA B such that L(B) = L<(S).

Our proof does not give a construction of the grammar from the schema S. The grammar depends on the formal semantics of JSON schemas which are still changing and being debated. Thus, to be more robust to changes in the semantics, we prefer to learn the minimal 1-SEVPA B accepting L<(S) given a fxed order <, in the sense of Theorem 3. <sup>9</sup> For learning, equivalence queries require to generate a certain number of random JSON documents.<sup>10</sup> If S and the learner's hypothesis H disagree on a document, we have a counterexample. Otherwise, we say that H is correct. In both membership and equivalence queries, we only accept documents whose key order inside objects satisfy the order <. The randomness used in the equivalence queries implies that the learned 1-SEVPA may not exactly accept L<(S). Setting the number of generated documents to be large would help reducing the probability that an incorrect 1-SEVPA is learned.

## 4 Validation of JSON Documents

For this section, let us fx a schema S, an order < on Σkey, and a 1-SEVPA A = (Q, Σ˜ JSON, Γ, δ, {q0}, Q<sup>F</sup> ) accepting L<(S). We present a streaming algorithm to decide if a document J is in L(S). By "streaming", we mean an algorithm that processes the document in a single pass, symbol by symbol. Our new approach is as follows. We learn A such that L(A) = L<(S). As L<(S) ≠ L(S), we design an algorithm that uses A in a clever way to allow arbitrary key orders in documents to validate. To do this, we use a key graph defned in the sequel.

Key Graph In this section, w.l.o.g. we suppose that A has no bin states. Let T<sup>A</sup> be the transition system of A. We explain how to associate to A its key graph GA: an abstraction of the paths of T<sup>A</sup> labeled by the contents of the objects appearing in words of L<(S). This graph is essential in our validation algorithm.

Defnition 3. The key graph G<sup>A</sup> of A has:

– the vertices (p, k, p′ ) with p, p′ ∈ Q and k ∈ Σkey if there exists in T<sup>A</sup> a path ⟨p, ε⟩ kv −→ ⟨p ′ , ε⟩ with v ∈ ΣpVal ∪ {aua¯ | a ∈ Σc, u ∈ WM(Σ˜ JSON)}, 11

<sup>9</sup>We use this automaton in the next section for the validation of JSON documents. We do not use a 1-SEVPA for L(S) as it could be exponentially larger.

<sup>10</sup>It is common to proceed this way in automata learning, as explained in [4, Sec. 4]. <sup>11</sup>Notice that each vertex (p, k, p′ ) of G<sup>A</sup> only stores the key k and not the word kv.

q0, title, q<sup>2</sup> q3, conf, q<sup>10</sup> q0, name, q<sup>6</sup> q7, year, q<sup>9</sup>

Fig. 1: A 1-SEVPA for the schema from Listing 2, without the key keywords.

Fig. 2: The key graph for the 1- SEVPA from Figure 1.

– the edges ((p1, k1, p′ 1 ),(p2, k2, p′ 2 )) if there exists (p ′ 1 , #, p2) ∈ δi.

We have the following property.

Lemma 1. There exists a path ((p1, k1, p′ 1 ). . .(pn, kn, p′ n )) in G<sup>A</sup> with p<sup>1</sup> = q<sup>0</sup> if and only if there exist a factor u of a word in L<(S) such that u = k1v<sup>1</sup> #. . .# knv<sup>n</sup> where each kiv<sup>i</sup> is a key-value pair, and a path ⟨q0, ε⟩ <sup>u</sup>−→ ⟨p ′ n , ε⟩ in T<sup>A</sup> that decomposes as ⟨p<sup>i</sup> , ε⟩ <sup>k</sup>iv<sup>i</sup> −−→ ⟨p ′ i , ε⟩, ∀i ∈ {1, . . . , n} and ⟨p ′ i , ε⟩ # −→ ⟨pi+1, ε⟩, ∀i ∈ {1, . . . , n − 1}. Furthermore, there is no path ((p1, k1, p′ 1 ). . .(pn, kn, p′ n )) such that k<sup>i</sup> = k<sup>j</sup> for some i ̸= j. That is, G<sup>A</sup> contains a fnite number of paths.

Hence, paths in G<sup>A</sup> focus on contents of objects being part of JSON documents in L<(S). Moreover, they abstract paths in T<sup>A</sup> in the sense that only keys k<sup>i</sup> are stored and the subpaths labeled by the values v<sup>i</sup> are implicit.

Example 3. Consider the schema from Listing 2, without the key keywords. A 1-SEVPA A accepting L<(S) is given in Figure 1. For clarity, call transitions<sup>12</sup> and the bin state are not represented. In Figure 2, we depict its corresponding key graph GA. Since we have the path ⟨q0, ε⟩ title s −−−−→ ⟨q2, ε⟩ in TA, the triplet (q0, title, q2) is a vertex of GA. Likewise, (q0, name, q6) and (q7, year, q9) are vertices. As we have the path ⟨q4, ε⟩ <sup>≺</sup>−→ ⟨q0,(q4, ≺)⟩ name s # year i −−−−−−−−−−→ ⟨q9,(q4, <sup>≺</sup>)⟩ ≻ −→ ⟨q10, ε⟩, (q3, conf, q10) is also a vertex of GA. Finally, as ⟨q2, ε⟩ # −→ ⟨q3, ε⟩, we have an edge from (q0, title, q2) to (q3, conf, q10).

Computing the key graph can be done in polynomial time by frst computing the reachability relation ReachA. From this relation, the vertices can be easily found. Since the edges require to check whether a transition reading # exists, it is obvious that it can be done in polynomial time.

Validation Algorithm In this section, we provide a streaming algorithm that validates JSON documents against a given JSON schema S.

Given a word w ∈ Σ<sup>∗</sup> JSON \ {ε}, we want to check whether w ∈ L(S). The main difculty is that the key-value pairs inside an object are arbitrarily ordered in w while a fxed key order < is encoded in the 1-SEVPA A (L(A) = L<(S)).

<sup>12</sup>Recall the form of call transitions for 1-SEVPAs, see Defnition 2.

Our validation algorithm is inspired by the algorithm computing a det-VPA equivalent to some given VPA [3] (see Theorem 1 and its proof) and uses the key graph G<sup>A</sup> to treat arbitrary orders of the key-value pairs inside objects.

During the reading of w ∈ Σ<sup>∗</sup> JSON \ {ε}, in addition to checking whether w ∈ WM(Σ˜ JSON), the algorithm updates a subset R ⊆ Reach<sup>A</sup> and modifes the content of a stack Stk (push, pop, modify the element on top of Stk).

First, let us explain the information stored in R. Assume that we have read the prefx zau of w such that a ∈ Σ<sup>c</sup> is the last unmatched call symbol (thus za ∈ (WM(Σ˜ JSON) · Σc) <sup>∗</sup> and u ∈ WM(Σ˜ JSON)).


In the frst case, by using R as defned previously, we adopt the same approach as for the determinization of VPAs. In the second case, with u, we are currently reading the key-value pairs of an object in some order, not necessarily the one encoded in A. In this case the set R is focused on the currently read key-value pair knvn, that is, on the word u ′ . After reading of the whole object ≺k1v<sup>1</sup> # k2v<sup>2</sup> # . . . ≻, we will use the key graph G<sup>A</sup> to update the current set R.

Second, an element stored in the stack Stk is either a pair (R, ⊏), or a 5-tuple (R, ≺, K, k, Bad), where R is a set as described previously, K ⊆ Σkey is a subset of keys, k ∈ Σkey is a key, and Bad is a set containing some vertices of GA. 13

We now detail our streaming validation algorithm.<sup>14</sup> Before reading w, we initialize R to the set I{q0} and Stk to the empty stack. Let us explain how to update the current set R and the current content of the stack Stk while reading the input word w. Suppose that we are reading the symbol a in w. In some cases we will also peek the symbol b following a (lookahead of one symbol).

Case (1) Suppose that a is the symbol ⊏, i.e., we start an array. Hence (R, ⊏) is pushed on Stk and R is updated to RUpd = I{q0}. We thus proceed as in the proof of Theorem 1 (with I{q0} instead of IQ, since A is a 1-SEVPA<sup>12</sup>).

Case (2) Suppose that a ∈ Σ<sup>i</sup> and ⊏ appears on top of Stk. We are thus reading the elements of an array. Hence R is updated to RUpd = {(p, q) | ∃(p, q′ ) ∈ R,(q ′ , a, q) ∈ δi}. Again we proceed as in the proof of Theorem 1.

Case (3) Suppose that a is the symbol ⊐. This means that we fnished reading an array. If the stack is empty or its top element contains ≺, then w ̸∈ L(S) and we stop the algorithm. Otherwise (R′ , ⊏) is popped from Stk and R is updated to RUpd = {(p, q) | ∃(p, p′ ) ∈ R′ ,(p ′ , ⊏, q0, γ) ∈ δc,(q0, r) ∈ R,(r, ⊐, γ, q) ∈ δr}, as in the proof of Theorem 1.

Case (4) Suppose that a is the symbol ≺.

<sup>13</sup>In the particular case of the object ≺≻, the 5-tuple (R, ≺, K, k, Bad) is replaced by (R, ≺). This situation will be clarifed during the presentation of our algorithm.

<sup>14</sup>Note that the algorithm assumes we have a 1-SEVPA.

– Let us frst consider the particular case where the symbol b following ≺ is equal to ≻, meaning that we will read the object ≺≻. In this case, (R, ≺) is pushed on Stk and R is updated to RUpd = I{q0} as in Case (1).

– Otherwise, if b belongs to Σkey, we begin to read a (non-empty) object whose treatment is diferent from that of an array as its key-value pairs can be read in any order. Then, R is updated to RUpd = I<sup>P</sup><sup>b</sup> where P<sup>b</sup> = {p ∈ Q | ∃(p, b, p′ ) ∈ GA}, and (R, ≺, K, b, Bad) is pushed on Stk such that K is the singleton {b} and Bad is the empty set. The 5-tuple pushed on Stk indicates that the key-value pair that will be read next begins with key b; moreover K = {b} because this is the frst pair of the object. The meaning of Bad will be clarifed later. The updated set RUpd is equal to the identity relation on P<sup>b</sup> since after reading ≺, we will start reading a key-value pair whose abstracted state in G<sup>A</sup> can be any state from Pb. Later while reading the object whose reading is here started, we will update the 5-tuple on top of Stk as explained below.

– Finally, it remains to consider the case where b ̸∈ Σkey ∪ {≻}. In this fnal case, we have that w ̸∈ L(S) and we stop the algorithm.

Case (5) Suppose that a ∈ Σ<sup>i</sup> \ {#} and ≺ appears on top of Stk. Therefore, we are currently reading a key-value pair of an object. Then R is updated to RUpd = {(p, q) | ∃(p, q′ ) ∈ R,(q ′ , a, q) ∈ δi}.

Case (6) Suppose that a is the symbol # and ≺ appears on top of Stk. This means that we just fnished reading a key-value pair whose key k is stored in the 5-tuple (R′ , ≺, K, k, Bad) on top of Stk, and that another key-value pair will be read after symbol #. The set K in (R′ , ≺, K, k, Bad) stores all the keys of the key-values pairs already read including k.

– If the symbol b following # does not belong to Σkey, then w ̸∈ L(S) and we stop the algorithm.

– Otherwise, if b belongs to K, this means that the object contains twice the same key, that is, w ̸∈ L(S), and we also stop the algorithm.

– Otherwise, the set R is updated to RUpd = I<sup>P</sup><sup>b</sup> (as we begin the reading of a new key-value pair whose key is b) and the 5-tuple (R′ , ≺, K, k, Bad) on top of Stk is updated such that (i) K is replaced by K ∪ {b}, (ii) k is replaced by b, and (iii) all vertices (p, k, p′ ) of G<sup>A</sup> such that (p, p′ ) ̸∈ R are added to the set Bad. Recall that the vertex (p, k, p′ ) of G<sup>A</sup> is a witness of a path ⟨p, ε⟩ kv −→ ⟨p ′ , ε⟩ in T<sup>A</sup> for some key-value pair kv. Hence by adding this vertex (p, k, p′ ) to Bad, we mean that the pair that has just been read does not use such a path.

Case (7) Suppose that a is the symbol ≻. Therefore we end the reading of an object. If the stack is empty or its top element contains ⊏, then w ̸∈ L(S) and we stop the algorithm. Otherwise the top of Stk contains either (R′ , ≺) or (R′ , ≺, K, k, Bad) that we pop from Stk.

– If (R′ , ≺) is popped, then we are ending the reading of the object ≺≻. Hence, we proceed as in Case (3): R is updated to RUpd = {(p, q) | ∃(p, p′ ) ∈ R′ ,(p ′ , ≺, q0, γ) ∈ δc,(q0, ≻, γ, q) ∈ δr}. 15

<sup>15</sup>Notice that R does not appear in RUpd as R = I{q0}.

– If (R′ , ≺, K, k, Bad) is popped, we are ending an object whose last seen key is k. As in Case (6), we add to Bad all vertices (p, k, p′ ) such that (p, p′ ) ∈/ R. Let Valid(K, Bad) be the set of pairs of states (q0, r′ ) such that there exists a path ((p1, k1, p′ 1 ). . .(pn, kn, p′ n )) in G<sup>A</sup> with p<sup>1</sup> = q0, p ′ <sup>n</sup> = r ′ ,(p<sup>i</sup> , k<sup>i</sup> , p′ i ) ̸∈ Bad for all i ∈ {1, . . . , n}, and K = {k1, . . . , kn}. Then R is updated to RUpd = {(p, q) | ∃(p, p′ ) ∈ R′ ,(p ′ , ≺, q0, γ) ∈ δc,(q0, r) ∈ Valid(K, Bad),(r, ≻, γ, q) ∈ δr}. We thus proceed as in Case (3) except that condition (r ′ , r) ∈ R is replaced by (r ′ , r) ∈ Valid(K, Bad). That way, we check that the key-value pairs that have been read as composing an object of w label some path in TA, once ordered by <. That is, the corresponding abstract path appears in GA.

Case (8) Suppose that a ∈ Σ<sup>i</sup> and Stk is empty, then w ̸∈ L(S) and we stop the algorithm. Indeed an internal symbol appears either in an array or in an object (see Cases (2), (5), and (6) above).

Finally, when the input word w is completely read, we check whether the stack Stk is empty and the computed set R contains a pair (q0, q) with q ∈ Q<sup>F</sup> .

The complexity of our algorithm is given in the following proposition.

Proposition 1. Let S be a schema and A be a 1-SEVPA such that L(A) = L<(S). Deciding if a document J is valid is in time O(|J| · (|Q| <sup>4</sup> + |Q| |Σkey| · |Σkey| <sup>|</sup>Σkey|+1)), and uses O(|δ| + |Q| 2 · |Σkey| + d(J) · (|Q| <sup>2</sup> + |Σkey|)) memory.

## 5 Implementation and Experiments

We present here our Java implementation of the learning process and the validation algorithm. First, we present classical validation algorithms and explain how to generate documents from a schema. We then explain how the required membership and equivalence queries are implemented. Finally, we present the schemas we evaluated, and the results for the learning, computation of the key graph, and validation experiments. The reader is referred to the code documentation for more details about our implementation [27–31].

In the remaining of this section, let us assume we have a JSON schema S0.

Classical Validation Algorithm and Documents Generation Let us explain briefy the classical algorithm used in many implementations for validating a JSON document J<sup>0</sup> against S<sup>0</sup> [13]. It is a recursive algorithm that follows the constraints of S0. <sup>16</sup> For instance, if the current value J is an object, we iterate over each key-value pair in J and its corresponding sub-schema in the current schema S. Then, J satisfes S if and only if the values in the key-value pairs all satisfy their corresponding sub-schema. As long as S<sup>0</sup> does not contain any Boolean operations, this algorithm is straightforward and linear in the size of both the initial document J<sup>0</sup> and schema S0. However, if S<sup>0</sup> contains Boolean operations, then the current value J may be processed multiple times.

<sup>16</sup>Such a recursive algorithm is briefy presented in [19].

In order to match the abstractions we defned (see Section 3) and to have options to tune the learning process, we implemented our own classical validator. Alongside the validator, we implemented a tool to generate JSON documents whose structure is dictated by S0. Due to the Boolean operations S<sup>0</sup> can contain, it may happen that choices must be made during the generation process. We have two generators: a random generator that makes a choice at random, and an exhaustive generator that exhaustively explores every choice, thus producing every valid document one by one. Moreover, we implemented modifcations of these generators to allow the creation of invalid documents, by allowing deviations. <sup>17</sup> For instance, if the current schema describes an integer, we can instead decide to generate a string. To ensure we eventually produce a document, we can fx a maximal depth (i.e., the maximal number of nested objects or arrays). This is useful for recursive schemas, or when generating invalid documents.

Learning Algorithm Let us now focus on the learning algorithm itself, and in particular on the membership and equivalence queries. We recall that the equivalence queries are performed by generating a certain number of (valid and invalid) JSON documents and by verifying that the learned VPA H and the given schema S<sup>0</sup> agree on the documents' validity. As said in Section 2, we use the TTT algorithm [9] to learn a 1-SEVPA from S0, relying on its implementation in the well-known Java libraries LearnLib and AutomataLib [11].

We use the random and exhaustive generators of valid and invalid documents as explained above and we fx two constants C and D depending on the schema to be learned.<sup>18</sup> For a membership query over a word w ∈ Σ<sup>∗</sup> JSON, the teacher runs the classical validator on w and S0. For an equivalence query over a learned 1-SEVPA H, the teacher uses a generator to produce documents on which H is tested. If that generator is random, at each query, C documents are generated for each document depth between 0 and D. If none of the documents leads to a counterexample, the teacher checks whether G<sup>H</sup> does not satisfy Lemma 1, i.e., whether there is path ((p1, k1, p′ 1 ). . .(pn, kn, p′ n )) with p<sup>1</sup> = q<sup>0</sup> such that k<sup>i</sup> = k<sup>j</sup> for some i ≠ j. In that case, we can create a counterexample.

Evaluated Schemas For the experimental evaluation of our algorithms, we consider the following schemas, sorted in increasing size: (1) A schema that accepts documents defned recursively. Each object contains a string and can contain an array whose single element satisfes the whole schema, i.e., this is a recursive list. (2) A schema that accepts documents containing each type of values, i.e., an object, an array, a string, a number, an integer, and a Boolean. (3) A schema that defnes how snippets must be described in Visual Studio Code [23]. (4) A recursive schema that defnes how the metadata fles for VIM plugins must be written [22]. (5) A schema that defnes how Azure Functions Proxies fles must look like [20]. (6) A schema that defnes the confguration fle

<sup>17</sup>This is similar to mutation testing [1, 12].

<sup>18</sup>The values of C and D are given below.

for a code coverage tool called codecov [21]. Hence, we consider two schemas written by ourselves to test our framework, and four schemas that are used in real world cases. The last four schemas were modifed to make all object keys mandatory and to remove unsupported keywords. All used schemas and scripts can be consulted on our repository [30]. In the rest of this section, the schemas are referred to by their order in the previous enumeration.

We present three types of experimental results: (1) the time and number of membership and equivalence queries to learn a 1-SEVPA A from a JSON schema, (2) the time and memory to compute the reachability relation Reach<sup>A</sup> and the key graph GA, and (3) the time and memory to validate a document using both classical and new algorithms. The server used for the benchmarks ran OpenJDK version 11.0.12 on Debian 10 over Linux 5.4.73-1-pve with a 4-core Intel® Xeon® Silver 4214R Processor with 16.5M cache, and 64GB of RAM.

Learning VPAs First, we learn a 1-SEVPA from a schema. We use an exhaustive generator for the frst three schemas (accepting a small set of documents), and a random generator<sup>19</sup> for the remaining three for which we fx C = 10000. For both generators, we set D = depth(S) + 1, where depth(S) is the maximal number of nested objects and arrays in the schema S, except for the recursive list where D = 10, and for the recursive VIM plugin schema where D = 7.

For the frst fve schemas, we do not set a time limit and repeat the learning process ten times. For the last schema, we set a time limit of one week and, for time constraints, only perform the learning process once. After that, we stop the computation and retrieve the learned 1-SEVPA at that point. The retrieved automaton is therefore an approximation of this schema. Its key graph has repeated keys along some of its paths, a situation that cannot occur if the 1-SEVPA was correctly learned, see Lemma 1. Results are given in Table 1.

Comparing Validation Algorithms The second part of the preprocessing step is to construct the key graph of the learned 1-SEVPA. For each evaluated schema, we select the learned automaton with the largest set of states, in order to report a worst-case measure. Results after a single experiment are given in Table 2. We can see that the storage of the key graph does not consume more than one megabyte, except for codecov schema. That is, even for non-trivial schemas, the key graph is relatively lightweight.

Finally, we compare both classical and new streaming validation algorithms. For the latter, we use the 1-SEVPA (and its key graph) selected as described above. We frst generate 5000 valid and 5000 invalid JSON documents using a random generator, with a maximal depth equal to D = 20. We then measure the time and memory required by both validation algorithms on these documents.<sup>20</sup>

<sup>19</sup>With the random generator, the learned 1-SEVPAS may difer each experiment.

<sup>20</sup>Since obtaining a close approximation of the consumed memory requires Java to stop the execution and destroy all unused objects, we execute each algorithm twice: once to measure time, and a second time to measure memory.


Table 1: Learning results. For the frst fve schemas, values are averaged out of ten experiments. For the last schema, a single experiment was conducted.


Table 2: Results for the computation of Reach<sup>A</sup> and GA. The Computation (resp. Storage) column gives the memory required to compute G<sup>A</sup> (resp. to store GA).

(a) Time usage (ms) for VIM plugins.

(e) Time usage for codecov.

(b) Mem. usage (kB) for VIM plugins.

Fig. 3: Results of validation benchmarks.

On all considered documents, both algorithms return the same classifcation output, even for the partially learned 1-SEVPA.

For our algorithm, we only measure the memory required to execute the algorithm, as we do not need to store the whole document to be able to process it. We also do not count the memory to store the 1-SEVPA and its key graph. As the classical algorithm must have the complete document stored in memory, we sum the RAM consumption for the document and for the algorithm itself. This is coherent to what happens in actual web-service handling: Whenever a new validation request is received, we would spawn a new subprocess that handles a specifc document. Since the 1-SEVPA and its key graph are the same for all subprocesses, they would be loaded in a memory space shared by all processes.

Experimental results indicate that our algorithm exhibits good performance. Results for the three smaller schemas are not presented here to save space, while they are given in Figure 3 for VIM plugins, Azure Functions Proxies, and codecov. The blue (resp. red) crosses (resp. circles) give the values for our (resp. the classical) algorithm. The x-axis gives the size of each (abstracted) document.

For both VIM plugins and Azure Functions Proxies, our algorithm consumes less memory than the classical one. For these benchmarks, memory and time usage seemingly trade of as we see that our algorithm usually requires more time to validate a document — a majority of that time is spent computing the set Valid(K, Bad). This tradeof, however, does not hold in general: There are schemas for which our algorithm performs better than the classical one, both in terms of time and memory, as it does not have to backtrack to validate a document, which reduces the time and memory space required.

For the codecov schema, we recall that the learning process was not completed, leading to an approximated 1-SEVPA with repeated keys in its key graph. This means that the computation of Valid(K, Bad) explores some invalid paths, increasing the memory and time consumed by our algorithm. Thus, it appears that, while a not completely learned 1-SEVPA can still be used in our algorithm, stopping the learning process early may increase the time and space required.

## 6 Future Work

As future work, one could focus on constructing the VPA directly from the schema, without going through a learning algorithm. While this task is easy if the schema does not contain Boolean operations, it is not yet clear how to proceed in the general case. Second, it could be worthwhile to compare our algorithm against an implementation of a classical algorithm used in the industry. This would require either to modify the industrial implementations to support abstractions, or to modify our algorithm to work on unabstracted JSON schemas. Third, in our validation approach, we decided to use a VPA accepting the JSON documents satisfying a fxed key order — thus requiring to use the key graph and its costly computation of the set Valid(K, Bad). It could be interesting to make additional experiments to compare this approach with one where we instead use a VPA accepting the JSON documents and all their key permutations — in this case, reasoning on the key graph would no longer be needed. Finally, motivated by obtaining efcient querying algorithms on XML trees, the authors of [26] have introduced the concept of mixed automata in a way to accept subsets of unranked trees where some nodes have ordered sons and some other have unordered sons. It would be interesting to adapt our validation algorithm to diferent formalisms of documents, such as the one of mixed automata.

Data-Availability Statement. The source code and experimental results that support the fndings of this study are available in Zenodo with the identifer https://doi.org/10.5281/zenodo.7309690 [31].

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Antichains Algorithms for the Inclusion Problem Between** *ω***-VPL**

Kyveli Doveri<sup>1</sup>,2() , Pierre Ganty<sup>1</sup> , and Luka Hadˇzi-Ðoki´c<sup>1</sup>

<sup>1</sup> IMDEA Software Institute, Madrid, Spain {kyveli.doveri,pierre.ganty,luka.hadzi-dokic}@imdea.org <sup>2</sup> Universidad Polit´ecnica de Madrid, Madrid, Spain

**Abstract.** We define novel algorithms for the inclusion problem between two visibly pushdown languages of infinite words, an EXPTime-complete problem. Our algorithms search for counterexamples to inclusion in the form of ultimately periodic words i.e. words of the form uv<sup>ω</sup> where u and v are finite words. They are parameterized by a pair of quasiorders telling which ultimately periodic words need not be tested as counterexamples to inclusion without compromising completeness. The pair of quasiorders enables distinct reasoning for prefixes and periods of ultimately periodic words thereby allowing to discard even more words compared to using the same quasiorder for both. We put forward two families of quasiorders: the state-based quasiorders based on automata and the syntactic quasiorders based on languages. We also implemented our algorithm and conducted an empirical evaluation on benchmarks from software verification.

## **1 Introduction**

Visibly pushdown languages [4] (VPL) have applications in various domains including verification [22], theorem proving [27] or XML schema languages reasoning [26] where the inclusion problem plays a crucial role. For instance proving correctness relative to a specification reduces to a language inclusion problem and so does proving correctness of a theorem of the form ∀x∃yP(x) =⇒ Q(y). The extension to the case of visibly pushdown languages of infinite words (ω-VPL) has also been studied in the context of program verification [21] and it has applications in word combinatorics [23,25,27].

We distinguish two general approaches to solve the language inclusion problem L ⊆ M: (i) complement M, intersect with L and check for emptiness of the result; and (ii) reduce the inclusion check to finitely many membership queries asking whether w ∈ M holds where w ∈ L and each query aims at finding a counterexample to inclusion.

© The Author(s) 2023

This work was partially funded by the ESF Investing in your future, the RYC-2016- 20281/MCIN/AEI/10.13039/501100011033, the Madrid regional government as part of the program S2018/TCS-4339 (BLOQUES-CM) co-funded by EIE Funds of the European Union, the PRODIGY Project (TED2021-132464B-I00) funded by MCIN and the European Union NextGenerationEU/ PRTR.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 290–307, 2023. https://doi.org/10.1007/978-3-031-30823-9 15

In this paper we focus on the second approach. Previous work in that space leverage relations between words to select a finite subset of words of L on which we run the membership queries. A class of relations that consistently yields good results in practice are quasiorders which discard words subsumed (for the quasiorder) by others. A key feature of such quasiorders is that the subset of L selected via the quasiorder must contain a counterexample to inclusion if there exists one. Quasiorders are a versatile heuristic that has been applied to inclusion problems for languages such as languages of finite words [3,10,14] (including visibly pushdown language [6]) or infinite words [1,2,12,13,16,24] and even tree languages [3,5]. Algorithms leveraging quasiorders are commonly referred to as antichains algorithms. Subsequent improvements (e.g. [2] improving [1]) often attempt at defining coarser quasiorders because they enable the selection of an even smaller subset of L.

Let us now turn to the inclusion problem between ω-VPL, an EXPTimecomplete problem. For that problem the selection of words of L is limited to ultimately periodic words, i.e. words of the form uv<sup>ω</sup>, where u and v are called prefix and period respectively. For an ultimately periodic word uv<sup>ω</sup> subsumption (for a quasiorder) simply means subsumption of (u, v) relative to a pair -<sup>1</sup> <sup>×</sup>-2 of quasiorders on finite words. The quasiorders found in the literature [17,18] are all equivalences and are all such that -<sup>1</sup> <sup>=</sup> -2.

In this paper, we propose a new family of algorithms for the inclusion problem between ω-VPL that leverages a subset of the ultimately periodic words, deemed legitimate decompositions and is parameterized by a pair of quasiorders and a decision procedure for the membership queries in M. We identify properties that such pair of quasiorders must satisfy so that the resulting algorithm actually decides the inclusion problem between two ω-VPL: (1) be decidable; (2) be well-quasiorders; (3) verify some monotonicity conditions w.r.t. word operations that are characteristic to ω-VPL and (4) satisfy a preservation property intuitively saying that a legitimate decomposition inside M cannot subsume a legitimate decomposition outside of M. We put forward two families of quasiorders satisfying (1) thru (4): the state-based quasiorders whose definition rely on a visibly pushdown automaton underlying M and the syntactic quasiorders whose definition is based solely on M. The syntactic orders are the "ideal" quasiorders in the sense they are the coarsest, hence they select the "smallest" subset of L. None of our quasiorders is symmetric, hence they are coarser than equivalences and in each and every pair we define the quasiorder on prefixes differs from the one on periods (i.e. -<sup>1</sup> <sup>=</sup> -<sup>2</sup>). We further prove that when instantiated with the state-based quasiorders and with a state-based decision procedure for membership queries the resulting algorithm, which we call the state-based algorithm, has a runtime that matches the corresponding problem complexity.

Finally we implement the state-based algorithm and evaluate it on various benchmarks collected from Friedmann et al. [18] and from SV-COMP<sup>3</sup>, the Software Verification competition. The empirical evaluation is carried out against Ultimate [21] which follows a complement, intersect and check for emptiness

<sup>3</sup> https://sv-comp.sosy-lab.org

approach. The preliminary conclusion of the empirical results is in favor of our approach as it scales up better.

Related Work. Bruyere et al. [6] proposed an antichain algorithm for the inclusion of VPL but they only tackle the problem for languages of finite words. The same limitation applies to Ganty et al. [19,20] where, moreover, they do not tackle the inclusion problem of VPL into VPL (the closest they tackle is CFL into regular). The extension from the finite to the infinite case was tackled in Doveri et al. [13] but they do not cover the case ω-VPL into ω-VPL (the closest they tackle is ω-CFL into ω-regular). Friedmann et al. [17,18] do tackle the ω-VPL into ω-VPL problem. However they do not leverage the full power of quasiorders (they use equivalence instead); they do not use distinct pruning techniques for prefix and periods; and they do not put forward syntactic quasiorders. A summary comparing our work (omegaVPLinc) with the closest works in the area is given at Table 1.

**Table 1.** Comparison of the closest work in the area based on the characteristics of the problem tackled (first two columns) and the techniques used (last three columns). N/A means non applicable, means no support and means full support. The labels ω, VPL, qo, -1 -= -<sup>2</sup> and syntactic qo ask respectively whether the work thereof tackles the problem of infinite words, tackles the problem of VPL, leverage quasiorders, defines distinct quasiorders for prefixes and periods, and defines syntactic quasiorders.

Bruyere et al. [6] N/A Ganty et al. [20] N/A Doveri et al. [13] Friedmann et al. [18] omegaVPLinc


## **2 Background**

Fix <sup>Σ</sup> <sup>Σ</sup>i∪Σc∪Σ<sup>r</sup> an alphabet (a finite non empty set of symbols) comprising three disjoint alphabets. The set of finite words and the set of infinite words over Σ are denoted by Σ<sup>∗</sup> and Σ<sup>ω</sup> respectively. We denote by the empty word and define <sup>Σ</sup><sup>+</sup> <sup>Σ</sup><sup>∗</sup>\{}. Given a word <sup>u</sup> <sup>=</sup> <sup>u</sup>0u<sup>1</sup> ···∈ <sup>Σ</sup><sup>∗</sup>∪Σ<sup>ω</sup> we say that a position <sup>j</sup> where <sup>j</sup> <sup>∈</sup> <sup>N</sup>, j < <sup>|</sup>u<sup>|</sup> and <sup>|</sup>u| ∈ <sup>N</sup> ∪ {ω} is the length of <sup>u</sup>, is an internal (resp. call, resp. return) position if u<sup>j</sup> ∈ Σ<sup>i</sup> (resp. u<sup>j</sup> ∈ Σc, resp. u<sup>j</sup> ∈ Σr).

**Visibly Pushdown Languages.** A Visibly Pushdown Automaton (VPA) over Σ is a tuple A = (Q, q<sup>I</sup> , Γ, δ, F), where Q is a finite set of states including an initial state q<sup>I</sup> ∈ Q, F ⊆ Q is the set of final states, Γ is the stack alphabet including a bottom-of-stack symbol ⊥ and δ = δ<sup>i</sup> ∪ δ<sup>c</sup> ∪ δ<sup>r</sup> consists of three transition relations δ<sup>i</sup> ⊆ Q × Σ<sup>i</sup> × Q, δ<sup>c</sup> ⊆ Q × Σ<sup>c</sup> × Q × Γ\{⊥} and δ<sup>r</sup> ⊆ Q × Σ<sup>r</sup> × Γ × Q. Configurations in A are pairs in Q × Γ∗. For a ∈ Σ we define the relation <sup>a</sup> between configurations as follows:


We lift the relation to words by transitivity and reflexivity, that is, for all u ∈ Σ∗, (q, w) <sup>∗</sup><sup>u</sup> (p, w ) when the configurations (q, w) and (p, w ) are related by a sequence of transitions such that the concatenation of the corresponding labels is the word u. We write (q, w) <sup>u</sup> (p, w ) when such a sequence includes a configuration whose state is final. A trace of A on a infinite word ξ = a0a<sup>1</sup> ···∈ <sup>Σ</sup><sup>ω</sup> is an infinite sequence (q0, w0) <sup>a</sup><sup>0</sup> (q1, w1) <sup>a</sup><sup>1</sup> ··· It is a final trace when q<sup>j</sup> ∈ F for infinitely many j's. It is an accepting trace when it is a final trace and (q0, w0)=(q<sup>I</sup> , <sup>⊥</sup>). The <sup>ω</sup>-language accepted by <sup>A</sup> is <sup>L</sup><sup>ω</sup>(A) {<sup>ξ</sup> <sup>∈</sup> <sup>Σ</sup><sup>ω</sup> <sup>|</sup> there is an accepting trace of <sup>A</sup> on <sup>ξ</sup>}. A language <sup>L</sup> <sup>⊆</sup> <sup>Σ</sup><sup>ω</sup> is <sup>ω</sup>-VPL if <sup>L</sup> <sup>=</sup> <sup>L</sup><sup>ω</sup>(A) for some VPA <sup>A</sup>. Two examples of VPA are given at Fig. 1, <sup>A</sup> has an accepting trace on c cr cr cr . . . and so does B on crr crr . . .

**Fig. 1.** Two ω-VPA with Γ = {A, ⊥}, Σ<sup>i</sup> = ∅, Σ<sup>c</sup> = {c} and Σ<sup>r</sup> = {r}.

**Ultimately Periodic Words.** An ultimately periodic word is an infinite word <sup>ξ</sup> <sup>∈</sup> <sup>Σ</sup><sup>ω</sup> such that <sup>ξ</sup> <sup>=</sup> uv<sup>ω</sup> for some finite prefix <sup>u</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> and some finite period <sup>v</sup> <sup>∈</sup> <sup>Σ</sup><sup>+</sup>. We call the couple (u, v) <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> <sup>×</sup> <sup>Σ</sup><sup>+</sup> <sup>a</sup> decomposition of <sup>ξ</sup>. Note that <sup>ξ</sup> admits infinitely many decompositions.

Ultimately periodic words play a central role in our approach as they suffice for the inclusion problem as shown by the following theorem. <sup>4</sup>

## **Theorem 1.** Let L, M <sup>⊆</sup> <sup>Σ</sup><sup>ω</sup> be <sup>ω</sup>-VPL. Then, <sup>L</sup> <sup>⊆</sup> <sup>M</sup> iff <sup>∀</sup>uv<sup>ω</sup> <sup>∈</sup> L, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>.

**Matching Relation.** The partition of the alphabet Σ = Σ<sup>i</sup> ∪ Σ<sup>c</sup> ∪ Σ<sup>r</sup> induces a unique matching relation between a word's call and return positions (see [18]). Given <sup>u</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup>∪Σ<sup>ω</sup> define the matching relation of <sup>u</sup>, denoted <sup>u</sup>, as the unique relation on its call and return positions such that for every j <sup>u</sup> k we have <sup>0</sup> <sup>≤</sup> j<k< <sup>|</sup>u|, <sup>u</sup><sup>j</sup> <sup>∈</sup> <sup>Σ</sup>c, <sup>u</sup><sup>k</sup> <sup>∈</sup> <sup>Σ</sup>r, |{<sup>n</sup> <sup>|</sup> <sup>j</sup> <sup>u</sup> <sup>n</sup>}| ≤ 1, |{<sup>n</sup> <sup>|</sup> <sup>n</sup> <sup>u</sup> k}| ≤ 1 and there are no j , k with j <sup>u</sup> k and j<j <k<k . Given j <sup>u</sup> k we say that j and k are matched positions. A call (resp. return) position j in u is unmatched

<sup>4</sup> Theorem 1 can be easily obtained by adapting the proof of Fact 1 in [7].

if j <sup>u</sup> k (resp. k <sup>u</sup> j) for no k. Furthermore, for every unmatched positions n in u there is no j <sup>u</sup> k such that j<n<k, and if u<sup>n</sup> ∈ Σ<sup>c</sup> (resp. u<sup>n</sup> ∈ Σr) then there is no unmatched return (resp. call) position k with n<k (resp. k<n). A word is said to be well-matched if it has no unmatched position.

## **3 Foundations**

In this section we outline our approach which, given a VPA A = (Q, q<sup>I</sup> , Γ, δ, F) and an <sup>ω</sup>-VPL <sup>M</sup>, reduces the inclusion problem <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>M</sup> to finitely many membership queries in M. More precisely, we derive a finite subset Sfinite of ultimately periodic words of <sup>L</sup><sup>ω</sup>(A) such that

$$L^{\omega}(\mathcal{A}) \subseteq M \iff \forall (u,v) \in S\_{\text{finite}}, \; uv^{\omega} \in M \; . \tag{\dagger}$$

**Reduction to Legitimate Decompositions.** Our first step is to reduce the inclusion check to a subset of ultimately words of <sup>L</sup><sup>ω</sup>(A) given by legitimate decompositions. To do so, we define W as the set of well-matched finite words, C (resp. R) as the set of finite words where all call (resp. return) positions are matched and U<sup>c</sup> as the set of finite words with at least one unmatched call position. In turn, we define the set of legitimate decompositions given by

$$\mathsf{Ld} \triangleq \mathsf{C} \times \mathsf{C} \cup \mathsf{U}\_{\mathsf{c}} \times \mathsf{R}$$

which, as shown next, is sufficient for the inclusion problem between ω-VPL.

**Theorem 2.** Let L, M <sup>⊆</sup> <sup>Σ</sup><sup>ω</sup> be <sup>ω</sup>-VPL. Then, <sup>L</sup> <sup>⊆</sup> <sup>M</sup> iff <sup>∀</sup>(u, v) <sup>∈</sup> Ld, uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup> <sup>=</sup><sup>⇒</sup> uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>.

Next we leverage the relations <sup>∗</sup> and of A to characterize the legitimate decompositions of the ultimately periodic words of <sup>L</sup><sup>ω</sup>(A). We start by defining the following languages of finite words for each pair p, q ∈ Q of state of A: <sup>L</sup>p,q {<sup>u</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> | ∃<sup>w</sup> <sup>∈</sup> <sup>Γ</sup>∗, (p, <sup>⊥</sup>) <sup>∗</sup><sup>u</sup> (q, w)} and <sup>L</sup>- p,q {<sup>u</sup> <sup>∈</sup> <sup>Σ</sup><sup>+</sup> | ∃<sup>w</sup> <sup>∈</sup> Γ∗, (p, ⊥) <sup>u</sup> (q, w)}. Finally, define the following subset of Ld:

$$S \triangleq \bigcup\_{p \in Q} L\_{q\_I, p\_{|\mathcal{C}}} \times L\_{p, p|\mathcal{C}}^{\otimes} \cup L\_{q\_I, p|\mathbb{U}\_{\mathbb{C}}} \times L\_{p, p|\mathbb{R}}^{\otimes}$$

where L|<sup>K</sup> is defined to be L ∩ K to emphasize that L is restricted to K.

Example 1. Consider the VPA <sup>A</sup> and <sup>B</sup> depicted in Fig. 1. We have <sup>L</sup><sup>ω</sup>(A) = <sup>R</sup><sup>ω</sup>, <sup>S</sup> = (<sup>W</sup> <sup>×</sup> <sup>W</sup>\{}) <sup>∪</sup> (R\<sup>C</sup> <sup>×</sup> <sup>R</sup>\{}) and <sup>L</sup><sup>ω</sup>(B) = ((W\{})r)<sup>ω</sup>.

**Proposition 1.** We have that uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(A) ⇐⇒ ∃(u , v ) <sup>∈</sup> S, uv<sup>ω</sup> <sup>=</sup> <sup>u</sup> v<sup>ω</sup>.

By Theorem 2 and Proposition 1 the subset S verifies:

$$L^{\omega}(\mathcal{A}) \subseteq M \iff \forall (u,v) \in S, \; uv^{\omega} \in M \; . \tag{1}$$

Next we reduce the inclusion check to a finite subset of S using quasiorders.

**Reduction to a Finite Basis.** A quasiorder (qo) on a set E, is a reflexive and transitive relation <sup>⊆</sup> <sup>E</sup> <sup>×</sup> <sup>E</sup>. Given two subsets X, Y <sup>⊆</sup> <sup>E</sup> the set <sup>Y</sup> is said to be a basis for <sup>X</sup> with respect to whenever <sup>Y</sup> <sup>⊆</sup> <sup>X</sup> and <sup>∀</sup><sup>x</sup> <sup>∈</sup> X, <sup>∃</sup><sup>y</sup> <sup>∈</sup> Y,y <sup>x</sup>. A qo is a well-quasiorder (wqo) if every subset of E admits a finite basis.

We obtain <sup>S</sup>finite as a finite basis for <sup>S</sup> with respect to <sup>×</sup> for a pair , of wqos.<sup>5</sup> To guarantee the direction ⇐ in Eq. (†) we need the pair , to be M-preserving, a notion we introduce below.

A pair , of qos on Σ<sup>∗</sup> is said to be M-preserving if for all (u, v),(u , v ) ∈ Ld such that (u, v),(u , v ) ∈ C × C or (u, v),(u , v ) ∈ U<sup>c</sup> × R,

$$\text{if } uv^{\omega} \in M, u \lessdot u' \text{ and } v \prec v' \text{ then } u'v'^{\omega} \in M \text{ ...}$$

Intuitively, M-preservation guarantees that if the inclusion does not hold then the finite basis Sfinite contains a counterexample.

Next, we fix a pair of M-preserving wqos , and show the existence of a subset <sup>S</sup>finite such that Eq. (†) holds. Since × is a wqo, there exist two finite bases <sup>S</sup><sup>1</sup> and <sup>S</sup><sup>2</sup> for <sup>S</sup>|C×<sup>C</sup> and <sup>S</sup>|Uc×<sup>R</sup> respectively w.r.t. <sup>×</sup> . We define <sup>S</sup>finite to be the union of such sets <sup>S</sup>1, <sup>S</sup>2, viz. <sup>S</sup>finite <sup>S</sup><sup>1</sup> <sup>∪</sup> <sup>S</sup><sup>2</sup> <sup>⊆</sup> <sup>S</sup>. We have that: <sup>∀</sup>(u, v) <sup>∈</sup> S, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup> <sup>=</sup>⇒ ∀(u, v) <sup>∈</sup> <sup>S</sup>finite, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>. We now turn to the converse implication. Assume that <sup>∀</sup>(u, v) <sup>∈</sup> <sup>S</sup>finite, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>. Let (u, v) <sup>∈</sup> <sup>S</sup>. If (u, v) <sup>∈</sup> <sup>S</sup>|C×<sup>C</sup> then there is (u0, v0) <sup>∈</sup> <sup>S</sup><sup>1</sup> such that (u0, v0) <sup>×</sup> (u, v). Since <sup>S</sup><sup>1</sup> <sup>⊆</sup> <sup>S</sup>|C×<sup>C</sup> <sup>⊆</sup> <sup>C</sup> <sup>×</sup> <sup>C</sup> we have that (u0, v0),(u, v) <sup>∈</sup> <sup>C</sup> <sup>×</sup> <sup>C</sup>. Since <sup>u</sup>0v<sup>ω</sup> <sup>0</sup> ∈ M and the pair , is <sup>M</sup>-preserving, we conclude that uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>. The case (u, v) <sup>∈</sup> <sup>S</sup>|Uc×<sup>R</sup> proceeds analogously. It follows that <sup>∀</sup>(u, v) <sup>∈</sup> S, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup> ⇐<sup>=</sup> <sup>∀</sup>(u, v) <sup>∈</sup> <sup>S</sup>finite, uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>. Hence, we derive Equation (†) using Equation (1).

In Section 4, we give a fixpoint characterization of S and in Section 5 we show that under some monotonicity conditions on the wqos and we can effectively compute a finite basis for S. We then give two examples of monotonic pairs of wqos in Section 6. In Section 7 we present our algorithm which given two VPA <sup>A</sup> and <sup>B</sup> decides the inclusion problem <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B). Therein we discuss the state-based algorithm and give an upper bound on its running time. Finally in Section 8 we report on an empirical evaluation.

## **4 Fixpoint Characterization**

In this section we give a least fixpoint characterization of S for the VPA A = (Q, q<sup>I</sup> , Γ, δ, F). To this end we work with the complete lattice (℘(Σ∗)<sup>n</sup>·|Q<sup>|</sup> 2 , ⊆ × ···×⊆), where n ∈ {4, 6} and each Cartesian product consists of n·|Q| <sup>2</sup> factors.

For a function <sup>f</sup> : <sup>E</sup> <sup>→</sup> <sup>E</sup> on a quasiordered set (E, ) and for all <sup>n</sup> <sup>∈</sup> <sup>N</sup>, we define the <sup>n</sup>-th iterate <sup>f</sup> <sup>n</sup> : <sup>E</sup> <sup>→</sup> <sup>E</sup> of <sup>f</sup> inductively as follows: <sup>f</sup> <sup>0</sup> λx.x; <sup>f</sup> <sup>n</sup>+1 <sup>f</sup> ◦ <sup>f</sup> <sup>n</sup>. The denumerable sequence of Kleene iterates of <sup>f</sup> starting from the bottom value ⊥ ∈ <sup>E</sup> is given by {<sup>f</sup> <sup>n</sup>(⊥)}<sup>n</sup>∈<sup>N</sup>. Recall that when (E, ) is a complete lattice and <sup>f</sup> : <sup>E</sup> <sup>→</sup> <sup>E</sup> is a monotone function (i.e. <sup>d</sup> <sup>d</sup> <sup>=</sup><sup>⇒</sup>

<sup>5</sup> The qo <sup>×</sup> is a wqo when both and are wqos.

f(d) f(d )) then by the Knaster–Tarski theorem, f has a least fixpoint lfp f given by the supremum of the ascending<sup>6</sup> sequence of Kleene iterates of f.

Given a n · |Q| <sup>2</sup>-dimensional vector <sup>X</sup> and a <sup>|</sup>Q<sup>|</sup> <sup>2</sup>-dimensional vector Y on ℘(Σ∗) we write Xi,p,q, for the (i, p, q)-component of X and Yp,q for the (p, q) component of <sup>Y</sup> . We define the following equations where X, X <sup>∈</sup> <sup>℘</sup>(W)|Q<sup>|</sup> 2 , Y,Y <sup>∈</sup> <sup>℘</sup>(C)|Q<sup>|</sup> 2 , Z, Z <sup>∈</sup> <sup>℘</sup>(R)|Q<sup>|</sup> 2 , and <sup>T</sup> <sup>∈</sup> <sup>℘</sup>(Uc)|Q<sup>|</sup> 2 :

W(X) = -<sup>L</sup>p,q|(Σi∪{}) <sup>∪</sup> - (p,c,p,γ)∈δc, (q,r,γ,q)∈δr cXp,q <sup>r</sup> <sup>∪</sup> - q∈Q Xp,q Xq,qp,q∈<sup>Q</sup> C(X, Y ) = -<sup>L</sup>p,q|Σr <sup>∪</sup> <sup>X</sup>p,q <sup>∪</sup> - q∈Q Yp,q Yq,qp,q∈<sup>Q</sup> R(X, Z) = -<sup>L</sup>p,q|Σc <sup>∪</sup> <sup>X</sup>p,q <sup>∪</sup> - q∈Q Zp,q Zq,qp,q∈<sup>Q</sup> U(Y, Z, T ) = -<sup>L</sup>p,q|Σc <sup>∪</sup> - p,q∈Q, Yp,p Tp,q Zq,qp,q∈<sup>Q</sup> W-(X, X ) = -L- p,q|Σi <sup>∪</sup> - (p,c,p,γ)∈δc, (q,r,γ,q)∈δr, {p,q}∩F =∅ cXp,q <sup>r</sup> <sup>∪</sup> - (p,c,p,γ)∈δc, (q,r,γ,q)∈δr {p,q}∩F =∅ cX <sup>p</sup>,q <sup>r</sup> <sup>∪</sup> - q∈Q (X p,q <sup>X</sup>q,q <sup>∪</sup> <sup>X</sup>p,q <sup>X</sup> <sup>q</sup>,q)p,q∈<sup>Q</sup> C-(X , Y, Y ) = -L- p,q|Σr <sup>∪</sup> <sup>X</sup> p,q <sup>∪</sup> - q∈Q (Y p,q <sup>Y</sup>q,q <sup>∪</sup> <sup>Y</sup>p,q <sup>Y</sup> <sup>q</sup>,q)p,q∈<sup>Q</sup> R-(X , Z, Z ) = -L- p,q|Σc <sup>∪</sup> <sup>X</sup> p,q <sup>∪</sup> - q∈Q (Z p,q <sup>Z</sup>q,q <sup>∪</sup> <sup>Z</sup>p,q <sup>Z</sup> <sup>q</sup>,q)p,q∈<sup>Q</sup> .

The equations W, C, R and U are used to obtain the set of words in W, C, R and U<sup>c</sup> respectively, that connect two configurations of A. The equations W-, C and R refine those of W, C and R by filtering out words not visiting final states. In turn we define the functions f<sup>A</sup> and r<sup>A</sup> used to obtain the prefixes u and the periods v respectively for the decompositions (u, v) ∈ S. Define

$$\begin{aligned} f\_{\mathcal{A}} \colon \wp(\Sigma^\*)^{4 \cdot |Q|^2} &\longrightarrow \wp(\Sigma^\*)^{4 \cdot |Q|^2} \\ (X, Y, Z, T) &\longmapsto (W(X), C(X, Y), R(X, Z), U(Y, Z, T)) \end{aligned}$$

for the prefixes, and for the periods define

$$\begin{aligned} r\_{\mathcal{A}} \colon \wp(\boldsymbol{\Sigma^\*})^{6 \cdot |Q|^2} &\longrightarrow \wp(\boldsymbol{\Sigma^\*})^{6 \cdot |Q|^2} \\ \mathcal{R}(X, Y, Z, X', Y', Z') &\longmapsto \left( W(X), C(X, Y), R(X, Z), W\_{\mathbb{B}}(X, X'), C\_{\mathbb{B}}(X', Y, Y'), R\_{\mathbb{B}}(X', Z, Z') \right) \ . \end{aligned}$$

The function f<sup>A</sup> (resp. rA) is monotone and the supremum of the ascending sequence of its Kleene iterates starting at the bottom value <sup>∅</sup> (∅,..., <sup>∅</sup>) of dimension 4 · |Q| <sup>2</sup> (resp. 6 · |Q<sup>|</sup> <sup>2</sup>) is the vector (Λ|<sup>W</sup>, Λ|<sup>C</sup>, Λ|<sup>R</sup>, Λ|Uc ) (resp. (Λ|<sup>W</sup>, Λ|<sup>C</sup>, Λ|<sup>R</sup>, Λ- |<sup>W</sup>, Λ- |<sup>C</sup>, Λ- <sup>|</sup><sup>R</sup>)) where <sup>Λ</sup>|<sup>J</sup> Lp,q|<sup>J</sup>p,q∈<sup>Q</sup> and <sup>Λ</sup>- <sup>|</sup><sup>J</sup> L- p,q|<sup>J</sup> p,q∈<sup>Q</sup> for J∈ {W, C, R, Uc}. Therefore, by the Knaster–Tarski theorem we obtain the following proposition.

**Proposition 2.** lfp <sup>f</sup><sup>A</sup> = (Λ|<sup>W</sup>, Λ|<sup>C</sup>, Λ|<sup>R</sup>, Λ|Uc ) and lfp <sup>r</sup><sup>A</sup> = (Λ|<sup>W</sup>, Λ|<sup>C</sup>, Λ|<sup>R</sup>, Λ- |W, Λ- |<sup>C</sup>, Λ- |R).

<sup>6</sup> A sequence {sn}<sup>n</sup>∈<sup>N</sup> <sup>∈</sup> <sup>E</sup><sup>N</sup> on an ordered set (E, ) is ascending if for every <sup>n</sup> <sup>∈</sup> <sup>N</sup> we have s<sup>n</sup> sn+1.

Finally, by Proposition 2, we obtain the desired fixpoint characterization of S:

$$S = \bigcup\_{p \in Q} \left( (\text{lfp } f\_{\mathcal{A}})\_{2, q\_I, p} \times (\text{lfp } r\_{\mathcal{A}})\_{5, p, p} \right) \cup \left( (\text{lfp } f\_{\mathcal{A}})\_{4, q\_I, p} \times (\text{lfp } r\_{\mathcal{A}})\_{6, p, p} \right) \right. \tag{2}$$

Example 2. We derive from the VPA A depicted in Fig. 1 the following functions

$$\begin{aligned} W(X) & \triangleq \{\epsilon\} \cup cXr \cup XX, & \qquad C(X,Y) & \triangleq X \cup YY, \\ R(X,Z) & \triangleq \{c\} \cup X \cup ZZ, & \qquad U(Y,Z,T) & \triangleq \{c\} \cup YTZ \; . \end{aligned}$$

Hence, we obtain the function

$$\begin{aligned} f\_{\mathcal{A}} &\colon \wp(\Sigma^\*)^4 \longrightarrow \wp(\Sigma^\*)^4\\ (X, Y, Z, T) &\longmapsto (W(X), C(X, Y), R(X, Z), U(Y, Z, T)) \end{aligned}$$

The first three iterates of the least fixpoint computation of lfp f<sup>A</sup> are given by

$$\begin{aligned} f\_{\mathcal{A}}(\vec{\varnothing}) &= (\{\epsilon\}, \emptyset, \{c\}, \{c\}), \\ f\_{\mathcal{A}}(^2\vec{\varnothing}) &= (\{\epsilon, cr\}, \{\epsilon\}, \{\epsilon, c, c^2\}, \{c\}), \\ f\_{\mathcal{A}}(^3\vec{\varnothing}) &= (\{\epsilon, cr, c^2r^2, (cr)^2\}, \{\epsilon, cr\}, \{\epsilon, cr, c, c^2, c^3, c^4\}, \{c, c^2, c^3\}), \\ &\vdots \\ \text{lfp } f\_{\mathcal{A}} &= (\mathsf{W}, \mathsf{W}, \mathsf{R}, \mathsf{R}\nmid \mathsf{C}) \end{aligned}$$

Since the unique state of <sup>A</sup> is a final state we have that <sup>L</sup><sup>q</sup><sup>I</sup> ,q<sup>I</sup> <sup>=</sup> <sup>L</sup>- <sup>q</sup><sup>I</sup> ,q<sup>I</sup> . Consequently, the function f<sup>A</sup> suffices to describe both the set of prefixes and the set of periods of S given by (lfp fA)<sup>2</sup> × (lfp fA)2\{-} ∪ (lfp fA)<sup>4</sup> × (lfp fA)3\{-} .

Each (i, p, q)-component of the Kleene iterates of f<sup>A</sup> and r<sup>A</sup> keeps a finite set of words. However, if the language <sup>L</sup><sup>ω</sup>(A) is infinite, the fixpoint computations of lfp f<sup>A</sup> and lfp r<sup>A</sup> do not terminate in a finite number of steps. Nevertheless, under some monotonicity assumptions on our wqos we show in the following section that we can compute a finite basis for <sup>S</sup> w.r.t. <sup>×</sup> as a terminating fixpoint computation.

## **5 Monotonicity Requirements**

In order to detect finite bases among the Kleene iterates of the functions defined in the previous section we replace the set inclusion on ℘(Σ∗), used so far, with the qo - ⊆ ℘(Σ∗) × ℘(Σ∗) defined by X - Y ⇐⇒ ∀<sup>x</sup> <sup>∈</sup> X, <sup>∃</sup><sup>y</sup> <sup>∈</sup> Y,y <sup>x</sup>. The qo leverage the notion of basis: given X ∈ ℘(Σ∗) a subset Y ⊆ X is a basis for <sup>X</sup> with respect to whenever <sup>X</sup> -Y .

In the following we lift the notion of basis to n-dimensional vectors component wise and work with the quasiordered sets (℘(Σ∗)<sup>n</sup>·|Q<sup>|</sup> 2 , <sup>n</sup>·|Q<sup>|</sup> 2 - ), where n ∈ {4, <sup>6</sup>} and the ordering <sup>n</sup>·|Q<sup>|</sup> 2 is given by the product - ×···× of n · |Q| 2 factors. Given a pair , of wqos, the orderings <sup>4</sup>·|Q<sup>|</sup> 2 and <sup>6</sup>·|Q<sup>|</sup> 2 are used to compare the Kleene iterates of the functions f<sup>A</sup> and r<sup>A</sup> respectively. For them to be apt to detect finite bases for the least fixpoints of these functions the qos and need to verify some monotonicity conditions.

We introduce the monotonicity conditions **W**, **C**, **R**, *C*- , **R** and **U** on a qo <sup>⊆</sup> <sup>Σ</sup><sup>∗</sup> <sup>×</sup> <sup>Σ</sup><sup>∗</sup> as follows: for all u, u <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> such that <sup>u</sup> <sup>u</sup>

> (**W**) if u, u <sup>∈</sup> <sup>W</sup> and <sup>c</sup> <sup>∈</sup> <sup>Σ</sup><sup>c</sup> ,r <sup>∈</sup> <sup>Σ</sup><sup>r</sup> then cur cu r, (**C**) if u, u <sup>∈</sup> <sup>C</sup> and <sup>s</sup> <sup>∈</sup> <sup>C</sup> ,t <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> then sut su t, (**R**) if u, u <sup>∈</sup> <sup>R</sup> and <sup>s</sup> <sup>∈</sup> <sup>Σ</sup>∗,t <sup>∈</sup> <sup>R</sup> then sut su t, (**U**) if u, u <sup>∈</sup> <sup>U</sup><sup>c</sup> and <sup>s</sup> <sup>∈</sup> <sup>C</sup> ,t <sup>∈</sup> <sup>R</sup> then sut su t, (**C**- ) if u, u <sup>∈</sup> <sup>C</sup> and <sup>s</sup> <sup>∈</sup> <sup>C</sup> ,t <sup>∈</sup> <sup>C</sup> then sut su t, (**R**- ) if u, u <sup>∈</sup> <sup>R</sup> and <sup>s</sup> <sup>∈</sup> <sup>R</sup> ,t <sup>∈</sup> <sup>R</sup> then sut su t.

A pair of qos , is monotonic if verifies **W**, **C**, **R**, **U** and verifies **W**, **C**- , **R**-.

**Proposition 3.** Let , be a pair of wqos. There is a positive integer n such that fAn+1( ∅) <sup>4</sup>·|Q<sup>|</sup> 2 <sup>f</sup>A<sup>n</sup>( ∅) (resp. rAn+1( ∅) <sup>6</sup>·|Q<sup>|</sup> 2 <sup>r</sup>A<sup>n</sup>( ∅)); and, if the pair of wqos is monotonic then lfp <sup>f</sup><sup>A</sup> <sup>4</sup>·|Q<sup>|</sup> 2 <sup>f</sup>A<sup>n</sup>( ∅) (resp. lfp <sup>r</sup><sup>A</sup> <sup>6</sup>·|Q<sup>|</sup> 2 <sup>r</sup>A<sup>n</sup>( ∅)).

Each Kleene iterate of <sup>f</sup><sup>A</sup> and <sup>r</sup><sup>A</sup> is computable and given a decidable qo on Σ<sup>∗</sup> and two finite sets X, Y ⊆ Σ<sup>∗</sup> it is decidable whether X - Y holds. Thus, given a monotonic pair , of decidable wqos, by Proposition 3, we can compute a finite basis for lfp <sup>f</sup><sup>A</sup> w.r.t. and a finite basis for lfp <sup>r</sup><sup>A</sup> w.r.t. . Hence, by Equation (2) we can compute a finite basis for <sup>S</sup> w.r.t. <sup>×</sup> .

## **6 Quasiorders for** *ω***-VPL**

In the following we present two families of qos to solve the inclusion problem <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>M</sup>, the state-based qos which are derived from a VPA-representation of M and compare words according to the set of configurations each word connects in the VPA, and the syntactic qos which rely on the syntactic structure of M. We say that a pair of qos is M-suitable if it is an M-preserving and monotonic pair of decidable wqos. Intuitively, if a pair of qos is M-suitable then it can be used in our algorithm to decide the inclusion <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>M</sup>.

**State-based Quasiorders.** Given a VPA <sup>B</sup> = (Q, <sup>ˆ</sup> <sup>q</sup>ˆ<sup>I</sup> , Γ , <sup>ˆ</sup> <sup>ˆ</sup>δ, <sup>F</sup>ˆ) we associate with each word u ∈ Σ<sup>∗</sup> its context ctxB[u] and final context ctx<sup>B</sup> -[u] in B as follows:

$$\begin{aligned} \text{ctx}^{\mathcal{B}}[u] & \triangleq \{ (p,q) \in \hat{Q}^2 \mid \exists w \in \hat{I}^\*, (p,\bot) \vdash^{\ast u} (q,w) \}, \\ \text{ctx}^{\mathcal{B}}\_{\boldsymbol{\Theta}}[u] & \triangleq \{ (p,q) \in \hat{Q}^2 \mid \exists w \in \hat{I}^{\ast}, (p,\bot) \vdash^{\circledast u} (q,w) \} \ . \end{aligned}$$

Hence we define the following qos on words in Σ∗:

$$u \lhd^{\mathcal{B}} u' \xleftarrow{\triangle} \text{ctx}^{\mathcal{B}}[u] \subseteq \text{ctx}^{\mathcal{B}}[u'], \quad u \nprec^{\mathcal{B}} u' \xleftarrow{\triangle} u \leqslant^{\mathcal{B}} u' \land \text{ctx}^{\mathcal{B}}\_{\oplus}[u] \subseteq \text{ctx}^{\mathcal{B}}\_{\oplus}[u'] \; . \; .$$

**Proposition 4.** Let <sup>B</sup> be a VPA. The pair B, <sup>B</sup> is <sup>L</sup><sup>ω</sup>(B)-suitable.

Example 3. Consider the pair of qos B, <sup>B</sup> derived as explained above from B (Fig. 1) and the set S = (W × W\{-}) ∪ (R\C × R\{-}) from Example 1. We have that ctxB[-] = {(p, p),(q, q)}, ctx<sup>B</sup> - [-] = {(p, p)}, ctxB[u] = {(p, q),(q, q)} and ctx<sup>B</sup> - [u] = {(p, q)} for every u ∈ R\{-}. We have that {c} is a basis for R\{-} w.r.t. <sup>B</sup> since <sup>c</sup> <sup>B</sup> <sup>u</sup> for every <sup>u</sup> <sup>∈</sup> <sup>R</sup>\{-}. Since R\C ⊆ R\{-} and {c} ⊆ R\C we deduce that {c} is also a basis for <sup>R</sup>\<sup>C</sup> w.r.t B. Similarly we deduce that {-, cr} is basis for <sup>W</sup> w.r.t <sup>B</sup> and that {cr} is a basis for <sup>W</sup>\{-} w.r.t. B. Hence, ({-, cr}×{cr}) <sup>∪</sup> ({c}×{c}) is a basis for <sup>S</sup> w.r.t. <sup>B</sup> <sup>×</sup> B.

**Syntactic Quasiorders.** Given a ω-VPL M we associate with each word u ∈ Σ<sup>∗</sup> its context ctx<sup>M</sup>[u] and final context ctx<sup>M</sup>-[u] in M as follows:

$$\begin{aligned} \text{ctx}^M[u] & \triangleq \{(s,\xi) \in \Sigma^\* \times \Sigma^\omega \mid su\xi \in M\}, \\ \text{ctx}^M\_{\upharpoonright}[u] & \triangleq \{(s,t) \in \Sigma^\* \times \Sigma^\* \mid s(ut)^\omega \in M\} \ . \end{aligned}$$

At first glance, we are tempted to define the syntactic qos from ctx<sup>M</sup> and ctx<sup>M</sup>- in the analogue way we defined the state-based qos from the contexts and final contexts relatively to a VPA. Although, this definition provides a pair of M-preserving qos, it does not guarantee that the pair is M-suitable. To overcome this, we impose the respect of the partition <sup>P</sup> {W, <sup>C</sup>\W, <sup>R</sup>\W, <sup>U</sup>c\R} of <sup>Σ</sup>∗, meaning that two words compare only if they belong to a same subset of P. Additionally, given J ∈ P we compare two words of J by considering a restriction of their context and final context in M which depends on J. More precisely, we define the qo <sup>M</sup> on Σ<sup>∗</sup> as the union - <sup>J</sup>∈P <sup>M</sup> <sup>J</sup> where for every J ∈ P, the qo <sup>M</sup> <sup>J</sup> ⊆ J × J is defined by

$$\begin{split} &u \leqslant\_{\mathbb{W}}^{M} u' \xleftarrow{\triangle} \operatorname{ctx}^{M}[u] \subseteq \operatorname{ctx}^{M}[u'], \\ &u \leqslant\_{\mathbb{C}[\mathbb{W}}^{M} u' \xleftarrow{\triangle} \operatorname{ctx}^{M}[u]|\_{\mathbb{C}\times\Sigma^{\omega}} \subseteq \operatorname{ctx}^{M}[u']|\_{\mathbb{C}\times\Sigma^{\omega}}, \\ &u \leqslant\_{\mathbb{R}[\mathbb{W}}^{M} u' \xleftarrow{\triangle} \operatorname{ctx}^{M}[u]|\_{\mathbb{L}\Sigma^{\*}\times\mathbb{R}^{\omega}} \subseteq \operatorname{ctx}^{M}[u']|\_{\mathbb{L}\Sigma^{\*}\times\mathbb{R}^{\omega}}, \\ &u \leqslant\_{\mathbb{U}\_{\varepsilon}[\mathbb{R}}^{M} u' \xleftarrow{\triangle} \operatorname{ctx}^{M}[u]|\_{\mathbb{C}\times\mathbb{R}^{\omega}} \subseteq \operatorname{ctx}^{M}[u']|\_{\mathbb{C}\times\mathbb{R}^{\omega}}. \end{split}$$

Similarly, we define the qo <sup>M</sup> - <sup>J</sup>∈P <sup>M</sup> <sup>J</sup> on Σ<sup>∗</sup> where for every J ∈ P, <sup>M</sup> <sup>J</sup> ⊆ J × J is the qo defined by

$$\begin{split} &u \ \nleftharpoons\_{\mathsf{V}}^{M} u' \stackrel{\scriptstyle \scriptstyle \Delta}{\iff} u \ \leqslant\_{\mathsf{V}}^{M} u' \ \wedge \ \operatorname{ctx}\_{\oplus}^{M}[u] \subseteq \operatorname{ctx}\_{\oplus}^{M}[u'],\\ &u \ \nleftharpoonup\_{\mathsf{C}|\mathsf{V}} u' \ \stackrel{\scriptstyle \scriptstyle \Delta}{\iff} u \ \leqslant\_{\mathsf{C}|\mathsf{V}} u' \ \wedge \ \left(\operatorname{ctx}\_{\oplus}^{M}[u]\_{|\mathsf{C}\times\mathsf{C}} \subseteq \operatorname{ctx}\_{\oplus}^{M}[u']\_{|\mathsf{C}\times\mathsf{C}}\right),\\ &u \ \nleftharpoonup\_{\mathsf{R}|\mathsf{V}} u' \ \stackrel{\scriptstyle \scriptstyle \Delta}{\iff} u \ \leqslant\_{\mathsf{R}|\mathsf{V}} u' \ \wedge \ \left(\operatorname{ctx}\_{\oplus}^{M}[u]\_{|\Sigma^{\star}\times\mathsf{R}} \subseteq \operatorname{ctx}\_{\oplus}^{M}[u']\_{|\Sigma^{\star}\times\mathsf{R}}\right),\\ &u \ \nleftharpoonup\_{\mathsf{U}\_{\mathsf{c}}|\mathsf{R}} u' \ \stackrel{\scriptstyle \scriptstyle \Delta}{\iff} u, u' \in \mathsf{U}\_{\mathsf{c}}\,\mathsf{R} \ .\end{split}$$

**Proposition 5.** Let <sup>B</sup> be a VPA. The pair Lω(B) , Lω(B) is <sup>L</sup><sup>ω</sup>(B)-suitable. Proof (sketch). First we show that the pair <sup>M</sup>, <sup>M</sup> is M-preserving, where M <sup>L</sup><sup>ω</sup>(B). Let (u, v),(u , v ) <sup>∈</sup> <sup>C</sup> <sup>×</sup> <sup>C</sup> (resp. <sup>U</sup><sup>c</sup> <sup>×</sup> <sup>R</sup>) such that <sup>u</sup> <sup>M</sup> <sup>u</sup> , v <sup>M</sup> v and uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup>. From <sup>u</sup> <sup>M</sup> <sup>u</sup> and uv<sup>ω</sup> <sup>∈</sup> <sup>M</sup> we deduce that (-, v<sup>ω</sup>) <sup>∈</sup> ctx<sup>M</sup> <sup>|</sup>C×Σ<sup>ω</sup> [u] <sup>⊆</sup> ctx<sup>M</sup> <sup>|</sup>C×Σ<sup>ω</sup> [u ] (resp. (-, v<sup>ω</sup>) <sup>∈</sup> ctx<sup>M</sup> <sup>|</sup>C×R<sup>ω</sup> [u] <sup>⊆</sup> ctx<sup>M</sup> <sup>|</sup>C×R<sup>ω</sup> [u ]). Thus, u <sup>v</sup><sup>ω</sup> <sup>∈</sup> <sup>M</sup>. From v <sup>M</sup> v and u <sup>v</sup><sup>ω</sup> <sup>∈</sup> <sup>M</sup> we deduce that (u , -) <sup>∈</sup> ctx<sup>M</sup>- [v]|C×<sup>C</sup> <sup>⊆</sup> ctx<sup>M</sup>-C[v ]|C×<sup>C</sup> (resp. (u , -) <sup>∈</sup> ctx<sup>M</sup>- [v]|Σ∗×<sup>R</sup> <sup>⊆</sup> ctx<sup>M</sup>-C[v ]|Σ∗×<sup>C</sup>). Thus, u <sup>v</sup><sup>ω</sup> <sup>∈</sup> <sup>M</sup>.

We now show that the qo <sup>M</sup> satisfies the monotonicity conditions **C** and **R**. Let <sup>u</sup> <sup>M</sup> <sup>u</sup> such that u, u <sup>∈</sup> <sup>C</sup> (resp. u, u <sup>∈</sup> <sup>R</sup>). Let <sup>s</sup> <sup>∈</sup> <sup>C</sup> and <sup>t</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> (resp. <sup>s</sup> <sup>∈</sup> <sup>Σ</sup><sup>∗</sup> and <sup>t</sup> <sup>∈</sup> <sup>R</sup>). If u, u <sup>∈</sup> <sup>W</sup> then it is easy to check that sut <sup>M</sup> su t. Otherwise u, u ∈ C\W (resp. u, u ∈ R\W) and we distinguish two cases: if t ∈ C (resp. s ∈ R) then sut, su t ∈ C\W (resp. sut, su <sup>t</sup> <sup>∈</sup> <sup>R</sup>\W). We show that sut <sup>M</sup> <sup>C</sup>\<sup>W</sup> su t (resp. sut <sup>M</sup> <sup>R</sup>\<sup>W</sup> su t). Let (s , ξ) <sup>∈</sup> ctx<sup>M</sup>[sut]|C×Σ<sup>ω</sup> (resp. (s , ξ) <sup>∈</sup> ctx<sup>M</sup>[sut]|Σ∗×R<sup>ω</sup> ). Since s <sup>s</sup> <sup>∈</sup> <sup>C</sup> (resp. tξ <sup>∈</sup> <sup>R</sup><sup>ω</sup>), we deduce from <sup>u</sup> <sup>M</sup> <sup>C</sup>\<sup>W</sup> <sup>u</sup> (resp. <sup>u</sup> <sup>M</sup> <sup>R</sup>\<sup>W</sup> <sup>u</sup> ) that (s , ξ) <sup>∈</sup> ctx<sup>M</sup>[su t]|C×Σ<sup>ω</sup> (resp. (s , ξ) <sup>∈</sup> ctx<sup>M</sup>[su t]|Σ∗×R<sup>ω</sup> ). If t ∈ U<sup>c</sup> (resp. s ∈ Σ<sup>∗</sup>\R) then sut, su <sup>t</sup> <sup>∈</sup> <sup>U</sup>c\<sup>R</sup> and similarly we can show that sut <sup>M</sup> Uc\<sup>R</sup> su t. The proof that <sup>M</sup> and <sup>M</sup> are wqos follows from [9, Prop 1.2] by observing that for every <sup>J</sup> in the partition <sup>P</sup> of <sup>Σ</sup><sup>∗</sup> we have <sup>B</sup> <sup>|</sup>J×<sup>J</sup> <sup>⊆</sup> <sup>M</sup> <sup>J</sup> and <sup>B</sup> <sup>|</sup>J×<sup>J</sup> <sup>⊆</sup> <sup>M</sup> J , where <sup>B</sup> and <sup>B</sup> are the state-based qos previously defined.

Deciding the syntactic qos can be easily shown to be as hard as the inclusion problem between ω-VPL generated by VPA. Nevertheless, the syntactic qos act as a gold standard for quasiorders in the sense formalized in the next proposition.

**Proposition 6.** Let <sup>M</sup> <sup>⊆</sup> <sup>Σ</sup><sup>ω</sup> be an <sup>ω</sup>-VPL and , be a <sup>M</sup>-suitable pair of qos such that <sup>⊆</sup> . For every <sup>J</sup> ∈ P we have |J×<sup>J</sup> <sup>⊆</sup> <sup>M</sup> and |J×<sup>J</sup> <sup>⊆</sup> <sup>M</sup>.

By Propositions 5 and 6 the pair Lω(B) , Lω(B) is the greatest (w.r.t ⊆×⊆) among the <sup>L</sup><sup>ω</sup>(B)-suitable pairs , of qos that respect the partition <sup>P</sup> and that verify <sup>⊆</sup> .

## **7 Algorithm**

We are now in position to present our algorithm which, given two VPA A = (Q, q<sup>I</sup> , Γ, δ, F) and <sup>B</sup> = (Q, <sup>ˆ</sup> <sup>q</sup>ˆ<sup>I</sup> , Γ , <sup>ˆ</sup> <sup>ˆ</sup>δ, <sup>F</sup>ˆ) and a pair of <sup>L</sup><sup>ω</sup>(B)-suitable qos, decides the inclusion problem <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B).

Algorithm <sup>1</sup> computes a finite basis for <sup>S</sup> w.r.t. <sup>×</sup> (lines 1–2) and afterwards checks membership in <sup>L</sup><sup>ω</sup>(B) on every ultimately periodic word uv<sup>ω</sup> stemming from this finite basis (lines 3–7).

**Theorem 3.** Given the required inputs, Algorithm 1 decides the inclusion problem <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B).

Proof. As established by Proposition 3, given a monotonic pair , of decidable wqos, Algorithm <sup>1</sup> computes in line 1 (resp. line 2) a finite basis <sup>f</sup>A<sup>m</sup>( ∅) (resp.

**Algorithm 1:** Algorithm for deciding <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B) **Data:** VPA <sup>A</sup> = (Q, q<sup>I</sup> , Γ, δ, F) and <sup>B</sup> = (Q, <sup>ˆ</sup> <sup>q</sup>ˆ<sup>I</sup> , Γ , <sup>ˆ</sup> <sup>ˆ</sup>δ, <sup>F</sup>ˆ). **Data:** <sup>L</sup><sup>ω</sup>(B)-suitable pair , . **Data:** Procedure deciding uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(B) given (u, v). **<sup>1</sup>** Compute <sup>f</sup>A<sup>m</sup>(∅) with least <sup>m</sup> s.t. <sup>f</sup>Am+1(∅) <sup>4</sup>·|Q<sup>|</sup> 2 <sup>f</sup>A<sup>m</sup>(∅); **<sup>2</sup>** Compute <sup>r</sup>Am (∅) with least <sup>m</sup> s.t. <sup>r</sup>Am+1(∅) <sup>6</sup>·|Q<sup>|</sup> 2 <sup>r</sup>Am (∅); **<sup>3</sup> foreach** p ∈ Q **do <sup>4</sup> foreach** <sup>u</sup> <sup>∈</sup> (fA<sup>m</sup>(∅))<sup>2</sup>,q<sup>I</sup> ,p, <sup>v</sup> <sup>∈</sup> (rAm (∅))<sup>5</sup>,p,p **do <sup>5</sup> if** uv<sup>ω</sup> <sup>∈</sup>/ <sup>L</sup><sup>ω</sup>(B) **then return** false; **<sup>6</sup> foreach** <sup>u</sup> <sup>∈</sup> (fA<sup>m</sup>(∅))<sup>4</sup>,q<sup>I</sup> ,p, <sup>v</sup> <sup>∈</sup> (rAm (∅))<sup>6</sup>,p,p **do <sup>7</sup> if** uv<sup>ω</sup> <sup>∈</sup>/ <sup>L</sup><sup>ω</sup>(B) **then return** false; **8 return** true;

<sup>r</sup>Am ( -<sup>∅</sup>)) for lfp <sup>f</sup><sup>A</sup> (resp. lfp <sup>r</sup>A) w.r.t. (resp. ). Next define:

$$S^{m,m'}\_{\mathcal{A}} \triangleq \bigcup\_{p \in Q} \left( (\langle f\_{\mathcal{A}}{}^m(\vec{\emptyset})\rangle\_{2,q\_I,p} \times (r\_{\mathcal{A}}{}^{m'}(\vec{\emptyset}))\_{5,p,p} \right) \cup \left( (f\_{\mathcal{A}}{}^m(\vec{\emptyset}))\_{4,q\_I,p} \times (r\_{\mathcal{A}}{}^{m'}(\vec{\emptyset}))\_{6,p,p} \right) \right) \dots$$

Using Equation (2) we deduce that Sm,m <sup>A</sup> is a finite basis for <sup>S</sup> w.r.t. <sup>×</sup> . Since the pair , is <sup>L</sup><sup>ω</sup>(B)-preserving, by Section 3, we deduce that

$$L^{\omega}(\mathcal{A}) \subseteq L^{\omega}(\mathcal{B}) \iff \forall (u,v) \in S^{m,m'}\_{\mathcal{A}}, \; uv^{\omega} \in L^{\omega}(\mathcal{B})\;.$$

We remark that Algorithm 1 can be easily adapted to decide the inclusion problem between visibly pushdown languages of finite words. The adaptation to the finite words case omits the fixpoint computation of line 2 and iterates over the components (i, q<sup>I</sup> , p) where i ∈ {2, 3, 4} and where p ∈ F is a final state.

Example 4. Consider the iterates of the function f<sup>A</sup> from Example 2. One can check that f<sup>A</sup> 4( -<sup>∅</sup>) <sup>4</sup> <sup>B</sup> f<sup>A</sup> 3( -∅) (thus also f<sup>A</sup> 4( -<sup>∅</sup>) <sup>4</sup> <sup>B</sup> f<sup>A</sup> 3( -<sup>∅</sup>) since <sup>B</sup> <sup>⊆</sup> B). Thus, we check whether the inclusion <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B) holds on the finite set ({, cr}×{cr}) <sup>∪</sup> ({c, c<sup>2</sup>, c<sup>3</sup>}×{cr, c, c<sup>2</sup>, c<sup>3</sup>, c<sup>4</sup>}) and find the counterexample <sup>c</sup>(cr)<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(A)\L<sup>ω</sup>(B).

**Antichains Everywhere.** We show next that Algorithm 1 remains correct if, in the sequence of Kleene iterates of f<sup>A</sup> or rA, for each application of f<sup>A</sup> or r<sup>A</sup> we first select a finite basis for their arguments instead (using <sup>4</sup>·|Q<sup>|</sup> 2 for f<sup>A</sup> and <sup>6</sup>·|Q<sup>|</sup> 2 for rA).

**Proposition 7.** Let be a qo that verifies the monotonicity conditions **W**, **C**, **<sup>R</sup>**, **<sup>U</sup>**. If <sup>B</sup> is a basis for (X, Y, Z, T) <sup>∈</sup> <sup>℘</sup>(W)|Q<sup>|</sup> 2 <sup>×</sup> <sup>℘</sup>(C)|Q<sup>|</sup> 2 <sup>×</sup> <sup>℘</sup>(R)|Q<sup>|</sup> 2 <sup>×</sup> <sup>℘</sup>(Uc)|Q<sup>|</sup> 2 w.r.t. <sup>4</sup>·|Q<sup>|</sup> 2 , then fA(B) is a basis for fA(X, Y, Z, T) w.r.t. <sup>4</sup>·|Q<sup>|</sup> 2 . The analogue result holds for r<sup>A</sup> when satisfies the monotonicity conditions **<sup>W</sup>**, **<sup>C</sup>**- , **R**-.

Since every Kleene iterate of <sup>f</sup><sup>A</sup> belongs to <sup>℘</sup>(W)|Q<sup>|</sup> 2 <sup>×</sup> <sup>℘</sup>(C)|Q<sup>|</sup> 2 <sup>×</sup> <sup>℘</sup>(R)|Q<sup>|</sup> 2 × ℘(Uc)|Q<sup>|</sup> 2 given a basis <sup>B</sup> for <sup>f</sup>A<sup>n</sup>( -<sup>∅</sup>) w.r.t. <sup>4</sup>·|Q<sup>|</sup> 2 , by Proposition 7, fA(B) is a basis for <sup>f</sup>An+1( -<sup>∅</sup>) w.r.t. <sup>4</sup>·|Q<sup>|</sup> 2 . Hence, at each iteration we can select, for each (i, p, q)-component, a basis w.r.t. and then apply <sup>f</sup>A. In particular, we can keep antichains for each (i, p, q)-component, that is, finite bases of incomparable words. The analogue result holds for the Kleene iterates of rA.

## **7.1 State-based Algorithm**

Next we consider Algorithm 1 instantiated with the pair of state-based qos (§ 6).

**Data Structures.** Comparing two words given a state-based qo requires to compute the corresponding sets of contexts in B. Instead of computing contexts every time we need to compare two words we cache the context information along with each word for faster retrieval. More precisely, we cache ctxB[u] along with u when u is a prefix and we cache (ctxB[v], ctx<sup>B</sup> - [v]) along with v when v is a period. Next we go even further and explain that new context information can be computed inductively from already computed context information. Assume we are computing a new word during the fixpoint computation, for instance the word cur that is obtained by flanking c and r to u. We will show that the context information of cur can be computed directly from that of u, c and r instead of computing cur from "scratch".

**Fixpoint Computation.** Given an input vector the functions f<sup>A</sup> and r<sup>A</sup> add new words of type uu , and cur to its components, where c and r are fixed letters, and u, u are words already present in some components of the vector. The following equalities show that we can inductively compute the contexts and final contexts in B of newly added words in these functions: for every u, u ∈ C∪R, c ∈ Σc, r ∈ Σr, we have

$$\begin{aligned} \text{ctr}\mathbf{x}^{\mathcal{B}}[uu'] &= \{ (p,q) \in \dot{Q}^2 \mid \exists p\_i \in \dot{Q}, (p,p\_i) \in \text{ctr}\mathbf{x}^{\mathcal{B}}[u], (p\_i,q) \in \text{ctr}\mathbf{x}^{\mathcal{B}}[u'] \}, \\ \text{ctr}\mathbf{x}^{\mathcal{B}}[cur] &= \{ (p,q) \in \dot{Q}^2 \mid \exists (p',q') \in \text{ctr}\mathbf{x}^{\mathcal{B}}[u], \exists \gamma \in \hat{I}, (p,c,p',\gamma) \in \hat{\delta}\_c, (q',r,\gamma,q) \in \hat{\delta}\_r \} \end{aligned}$$

The definitions for ctx<sup>B</sup> - [uu ] and ctx<sup>B</sup> -[cur] are left as exercise to the reader.

Example 5. Using the above definition it is routine to check that ctxB[c r] = {(p, q),(q, q)} because cr = cr, ctxB[] = {(p, p),(q, q)} (Example 3) and (p, c, q, A),(q, c, q, A) <sup>∈</sup> <sup>ˆ</sup>δc, (q, r, A, q) <sup>∈</sup> <sup>ˆ</sup>δr.

Using the context information cached along words we check convergence of the fixpoint computations (lines 1–2) using the following qos directly on contexts <sup>⊆</sup> on <sup>℘</sup>(℘(Qˆ<sup>2</sup>))<sup>4</sup> for prefixes and ⊆×⊆ on <sup>℘</sup>(℘(Qˆ<sup>2</sup>) <sup>×</sup> <sup>℘</sup>(Qˆ<sup>2</sup>))<sup>6</sup> for periods.

Incidentally, as we show below, we can perform the membership checks of lines 5 and 7 (asking whether uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(B) given <sup>u</sup> and <sup>v</sup>) using the context information associated to the prefix u and period v and nothing else.

**Membership Check.** To decide membership in <sup>L</sup><sup>ω</sup>(B) we use the membership predicate Inc<sup>B</sup> defined for x, y1, y<sup>2</sup> <sup>∈</sup> <sup>℘</sup>(Qˆ<sup>2</sup>) as follows:

$$\operatorname{Inc}^{\mathfrak{B}}(x, y\_1, y\_2) \triangleq \exists q, p \in \mathring{Q}, (\mathring{q}\_I, q) \in x \land (q, p) \in y\_1^\* \land (p, p) \in y\_1^\* \circ y\_2 \circ y\_1^\* \; , \; q$$

where, given two binary relations y, y <sup>∈</sup> <sup>℘</sup>(Qˆ<sup>2</sup>) on states of <sup>B</sup>, the notation <sup>y</sup> ◦y denotes their composition, and y<sup>∗</sup> denotes the Kleene closure of y.

**Proposition 8.** For all (u, v) <sup>∈</sup> Ld, IncB(ctxB[u], ctxB[v], ctx<sup>B</sup> - [v]) ⇐⇒ uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(B) .

Proof. Let (u, v) ∈ Ld. Note that if v ∈ C (resp. v ∈ R) then for every positive integer <sup>n</sup> we have <sup>v</sup><sup>n</sup> <sup>∈</sup> <sup>C</sup> (resp. <sup>v</sup><sup>n</sup> <sup>∈</sup> <sup>R</sup>) and (p, q) <sup>∈</sup> ctxB[v] <sup>∗</sup> ⇐⇒ ∃n,(p, q) ∈ ctxB[v<sup>n</sup>]. Therefore, if IncB(ctxB[u], ctxB[v], ctx<sup>B</sup> - [v]) holds then there are q, p ∈ <sup>Q</sup><sup>ˆ</sup> and two positive integers n, m such that ( ˆq<sup>I</sup> , q) <sup>∈</sup> ctxB[u], (q, p) <sup>∈</sup> ctxB[v<sup>n</sup>] and (p, p) ∈ ctx<sup>B</sup> - [v<sup>m</sup>]. If (u, v) <sup>∈</sup> <sup>C</sup> <sup>×</sup> <sup>C</sup> then we deduce an accepting trace of <sup>B</sup> on uv<sup>ω</sup> of the form ( ˆq<sup>I</sup> , <sup>⊥</sup>) <sup>∗</sup><sup>u</sup> (q, <sup>⊥</sup>) ∗v<sup>n</sup> (p, ⊥) v<sup>m</sup> (p, <sup>⊥</sup>) for uv<sup>ω</sup>. If (u, v) <sup>∈</sup> <sup>U</sup><sup>c</sup> <sup>×</sup> <sup>R</sup> then we deduce an accepting trace of <sup>B</sup> on uv<sup>ω</sup> of the form ( ˆq<sup>I</sup> , ⊥) <sup>∗</sup><sup>u</sup> (q, w) ∗v<sup>n</sup> (p, ww ) v<sup>m</sup> (p, ww w) for some w, w , w ∈ Γ.

Conversely if uv<sup>ω</sup> <sup>∈</sup> <sup>L</sup><sup>ω</sup>(B) then there is an accepting trace of <sup>B</sup> on uv<sup>ω</sup>.

**–** If (u, v) ∈ C × C then this trace is of the form

$$(q\_I, \bot) \vdash^{\ast u} (q, \bot) \vdash^{\ast v} (q\_1, \bot) \vdash^{\ast v} (q\_2, \bot) \vdash^{\ast v} \cdot \cdots$$

Since <sup>Q</sup><sup>ˆ</sup> is finite, there is <sup>p</sup> <sup>∈</sup> <sup>Q</sup><sup>ˆ</sup> and a sequence {nk}<sup>k</sup>∈<sup>N</sup> such that <sup>q</sup><sup>n</sup><sup>k</sup> <sup>=</sup> <sup>p</sup> for all <sup>k</sup> <sup>∈</sup> <sup>N</sup>. Since the trace is accepting there is <sup>m</sup> <sup>∈</sup> <sup>N</sup> such that (p, ⊥) v<sup>m</sup> (p, ⊥).

**–** If (u, v) ∈ U<sup>c</sup> × R then it is of the form

$$(q\_I^{\cdot}, \bot) \vdash^{\ast u} (q, w\_0) \vdash^{\ast v} (q\_1, w\_1) \vdash^{\ast v} (q\_2, w\_1 w\_2) \vdash^{\ast v} \cdot \cdots$$

where for each <sup>j</sup> <sup>∈</sup> <sup>N</sup> no symbol of <sup>w</sup><sup>j</sup> is popped while reading <sup>v</sup> in the sequence of transitions (q<sup>j</sup> , w<sup>j</sup> ) <sup>∗</sup><sup>v</sup> (qj+1, wjwj+1). Thus, we can derive sequences (q<sup>j</sup> , ⊥) <sup>∗</sup><sup>v</sup> (qj+1, wj+1) for every <sup>j</sup> <sup>∈</sup> <sup>N</sup>. There is <sup>p</sup> <sup>∈</sup> <sup>Q</sup><sup>ˆ</sup> and a sequence {nk}<sup>k</sup>∈<sup>N</sup> such that <sup>q</sup><sup>n</sup><sup>k</sup> <sup>=</sup> <sup>p</sup> for all <sup>k</sup> <sup>∈</sup> <sup>N</sup> and since the trace is accepting there is <sup>m</sup> <sup>∈</sup> <sup>N</sup> such that (p, <sup>⊥</sup>) v<sup>m</sup> (p, w<sup>n</sup><sup>j</sup> ··· w<sup>n</sup>j+m).

In both cases we deduce that ( ˆq<sup>I</sup> , q) <sup>∈</sup> ctxB[u], (q, p) <sup>∈</sup> ctxB[v<sup>n</sup><sup>0</sup> ] and (p, p) <sup>∈</sup> ctx<sup>B</sup> - [v<sup>m</sup>]. Thus, IncB(ctxB[u], ctxB[v], ctx<sup>B</sup> -[v]) holds.

By showing how to reason on contexts directly (for comparisons, for applying functions f<sup>A</sup> and rA, for convergence check and for membership check) we removed the need to store words altogether since their contexts suffice. To sum up, Algorithm 1 instantiated with the state-based qos can be implemented by manipulating directly subsets of ℘(Qˆ<sup>2</sup>) (for the prefixes) and pairs of subsets of ℘(Qˆ<sup>2</sup>) (for the periods) thereby removing the need to store and manipulate words. We call this implementation of Algorithm 1 the state-based algorithm. We conclude this section with its complexity.

**Proposition 9.** Let <sup>n</sup> <sup>|</sup>Q|, <sup>n</sup><sup>ˆ</sup> <sup>|</sup>Qˆ<sup>|</sup> and <sup>m</sup> max{1, <sup>|</sup>Σ|}. The running time of the state-based algorithm is 2<sup>O</sup>(ˆn2) m<sup>2</sup>n<sup>4</sup>.

## **8 Experiments**

**Fig. 2.** Scatter plot comparing the runtime (in seconds) of Ultimate and omegaVPLinc on the Ultimate suite. Both axis feature a logarithmic scale. When a tool does not return an answer within 1800 seconds (it runs out of time or memory) the data point is plotted on the edge thereof (top edge for Ultimate, right edge for omegaVPLinc).

We implemented omegaVPLinc [11] , a Java prototype of the state-based algorithm and evaluated it against Ultimate from Heizmann et al. [21] which decides inclusion via complementation, intersection and emptiness check.<sup>7</sup>

Benchmarks. Our experiments use two sets of benchmarks. The first stems from [18] and consists of 5 queries <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> <sup>L</sup><sup>ω</sup>(B) given <sup>A</sup> and <sup>B</sup>. We first translated those VPA into the AutomataScript language that Ultimate and omegaV-PLinc can use and then we minimized them with Ultimate. The second set of benchmarks consists of 281 instances of VPA A, B1, B2,..., B<sup>n</sup> for which we run the query <sup>L</sup><sup>ω</sup>(A) <sup>⊆</sup> n <sup>i</sup>=1 <sup>L</sup><sup>ω</sup>(Bi). These VPA were computed by Ultimate from randomly selected tasks in SV-COMP (Software Verification Competition) termination category. We used Ultimate to compute the unions of B1,..., B<sup>n</sup> and then minimize the result before running each query.

<sup>7</sup> We excluded FADecider [18] from our evaluation because it returned 22 false positive answers on a randomly chosen subset of 50 from our 286 benchmarks. Counterexamples to inclusion for these benchmarks were validated with Ultimate. The problem has been reported.

Experimental Setup. We ran our experiment in Debian/GNU Linux 11 (Bullseye) 64bit, running on a server with 20 GB of RAM and 2 Xeon E5640 2.6 GHz CPUs. We used Ultimate version 0.2.1, with openJDK 11.0.13, whereas omegaVPLinc uses openJDK 17.0.1. Maximal heap size for both programs was set to 6 GB and they were given a timeout of 30 minutes (or, equivalently, 1800 seconds).

Results. Of the 5 benchmarks in the FADecider suite, omegaVPLinc is faster on 4 of them. Our prototype times out on the remaining one, while Ultimate runs out of memory. Of the 281 benchmarks in the Ultimate suite, omegaVPLinc correctly returns an answer on 253 (165 <sup>⊆</sup> and 88 ), times out on 27 and runs out of memory on 1. Ultimate, however, only terminates on 142 benchmarks, running out of memory on the remaining 139 (the red data points on the top edge in Fig. 2). There are 7 benchmarks for which Ultimate terminates, but omegaVPLinc doesn't (the data points on the right edge but not the top one), whereas there are 118 benchmarks for which omegaVPLinc terminates, but Ultimate doesn't (the red data points on the top edge but not the right one). Of the 135 benchmarks on which both tools terminate, omegaVPLinc is faster than Ultimate on 123 (data points touching no edges and above the diagonal). Moreover omegaVPLinc and Ultimate coincide on whether inclusion holds (98) or not (37). This empirical evaluation suggests that omegaVPLinc scales up better than Ultimate on both of these benchmark sets.

## **9 Conclusion and Future Work**

We presented novel algorithms to solve the inclusion problem between visibly pushdown languages of infinite words that leverage antichain-like techniques as well as the use of separate quasiorders for prefixes and periods of ultimately periodic words. Our empirical evaluation suggests that our approach scales up better than the ones relying on an explicit complementation. A future work is to extend our approach to the class of operator-precedence languages [15] which also enjoy an EXPTime-complete inclusion problem and which is strictly contained in the class of deterministic CFL, and strictly contains VPL [8].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Stack-Aware Hyperproperties<sup>⋆</sup>

Ali Bajwa2() , Minjian Zhang<sup>1</sup> , Rohit Chadha2() , and Mahesh Viswanathan<sup>1</sup>

> <sup>1</sup> University of Illinois, Urbana-Champaign, USA {minjian2,vmahesh}@illinois.edu <sup>2</sup> University of Missouri, Columbia, USA {azb9q8,chadhar}@missouri.edu

Abstract. A hyperproperty relates executions of a program and is used to formalize security objectives such as confdentiality, non-interference, privacy, and anonymity. Formally, a hyperproperty is a collection of allowable sets of executions. A program violates a hyperproperty if the set of its executions is not in the collection specifed by the hyperproperty. The logic HyperCTL\* has been proposed in the literature to formally specify and verify hyperproperties. The problem of checking whether a fnite-state program satisfes a HyperCTL\* formula is known to be decidable. However, the problem turns out to be undecidable for procedural (recursive) programs. Surprisingly, we show that decidability can be restored if we consider restricted classes of hyperproperties, namely those that relate only those executions of a program which have the same call-stack access pattern. We call such hyperproperties, stack-aware hyperproperties. Our decision procedure can be used as a proof method for establishing security objectives such as noninference for recursive programs, and also for refuting security objectives such as observational determinism. Further, if the call stack size is observable to the attacker, the decision procedure provides exact verifcation.

Keywords: Hyperproperties · Temporal Logic · Recursive Programs · Model Checking · Pushdown Systems · Visibly Pushdown Automata.

## 1 Introduction

Temporal logics HyperLTL and HyperCTL\* [5] were designed to express and reason about security guarantees that are hyperproperties [6]. A hyperproperty [6] is a security guarantee that does not depend solely on individual executions. Instead, a hyperproperty relates multiple executions. For example, non-interference, a confdentiality property, states that any two executions of a program that difer only in high-level security inputs must have the same lowsecurity observations. As pointed out in [6], several security guarantees are hyperproperties. The logic HyperCTL\* subsumes HyperLTL, and the problem of checking a fnite-state system against a HyperCTL\* formula is decidable [5].

© The Author(s) 2023

<sup>⋆</sup> Ali Bajwa was partially supported by NSF CNS 1553548. Rohit Chadha was partially supported by NSF CNS 1553548 and NSF SHF 1900924. Mahesh Viswanathan and Minjian Zhang were partially supported by NSF SHF 1901069 and NSF SHF 2007428.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 308–325, 2023. https://doi.org/10.1007/978-3-031-30823-9\_16

In this paper, we consider the problem of model checking procedural (recursive) programs against security hyperproperties. Recall recursive programs are naturally modeled as a pushdown system. Unlike the case of fnite-state transition systems, the problem of checking whether a pushdown system satisfes a HyperCTL\* formula is undecidable [16]. In contrast, CTL\* model checking is decidable for pushdown systems [3,18].

Our contributions. We consider restricted classes of hyperproperties for recursive programs, namely those that relate only those executions that have the same call-stack access pattern. Intuitively, two executions have the same stack access pattern if they access the call stack in the same manner at each step, i.e., if in one execution there is a push (pop) at a point, then there is a push (pop) at the same point in the other execution. Observe that if two executions have the same stack access pattern, then their stack sizes are the same at all times. We call such hyperproperties, stack-aware hyperproperties.

In order to specify stack-aware hyperproperties, we extend HyperCTL\* to sHCTL\*. The logic sHCTL\* has a two level syntax. At the frst level, the syntax is identical to HyperCTL\* formulas, and is interpreted over executions of the pushdown system with the same stack access pattern. At the top-level, we quantify over diferent stack access patterns. The formula Eψ is true if for some stack access pattern ρ of the system, the pushdown system restricted to executions with stack access pattern ρ satisfes the HyperCTL\* formula ψ. The formula Aψ is true if for each stack access pattern ρ of the system, the pushdown system restricted to executions with stack access pattern ρ satisfes the Hyper-CTL\* formula ψ. See Figure 1 on Page 8 for a side-by-side comparison of the syntax for HyperCTL\* and sHCTL\*. HyperLTL is extended to sHLTL similarly. Please note that sHCTL\* subsumes sHLTL, and that sHCTL\* (sHLTL) coincides with HyperCTL\* (HyperLTL) for fnite state systems as all executions of fnite state systems have the same stack access pattern.

We show that the model checking problem for sHCTL\* is decidable. We demonstrate three diferent ways this result can aid in verifying recursive programs. First, for security guarantees such as noninference [14], which are expressible in the ∀∃<sup>∗</sup> fragment of HyperLTL, we can use the model checking algorithm to establish that a recursive program satisfes the HyperLTL property. Secondly, for the ∀ ∗ fragment of HyperLTL, the model checking algorithm can be used to detect security faws by establishing that a recursive program does not satisfy security guarantees. Observational determinism [13,19] is an example of such a property. Finally, when the attacker can observe stack access patterns (or, equivalently, stack sizes), we can get exact verifcation for several properties. The assumption of the attacker observing stack access patterns holds, for example, in the program counter security model [15] in which the attacker has access to program counters at each step. As argued in [15], the program security model is appropriate to capture control-fow side channels such as those arising from timing behavior and/or disclosure of errors.

The decision procedure uses an automata-theoretic approach inspired by the model checking algorithm for fnite state systems and HyperCTL\* given in [10]. Since stack-aware hyperproperties relate only executions with the same stack access-pattern, a set of executions with the same stack access pattern can be encoded as a word over a pushdown alphabet, <sup>3</sup> and the problem of model checking a sHCTL\* formula can be reduced to the problem of checking emptiness of a non-deterministic visibly pushdown automaton (NVPA) over infnite words [1]. The reduction of the model checking problem to the emptiness problem is based on a compositional construction of an automaton for each sub-formula which accepts exactly the set of assignments to path variables that satisfy the sub-formula. For this construction to be optimal, we carefully leverage two equi-expressive classes of automata on infnite words, namely NVPAs and 1-way alternating jump automata (1-AJA) [4]. The model checking algorithm for sHCTL\* against procedural programs has a complexity that is very close to the complexity of model checking fnite state systems against HyperCTL\*. If g(k, n) denotes a tower of exponentials of height k, where the top most exponent is poly(n), then for a formula with formula complexity r, <sup>4</sup> and a system and formula whose size is bounded by n, our algorithm is in DTIME(g(⌈ r 2 ⌉, n)). In comparison, model checking fnite state systems against HyperCTL\* is in NSPACE(g(⌈ r 2 ⌉ − 1, n)). This slight diference in complexity is consistent with checking other properties like invariants for fnite state systems (NL) versus procedural programs (P).

We also prove that our model checking algorithm is optimal by proving a matching lower bound. Our proof showing DTIME(g(⌈ r 2 ⌉, n)-hardness of the model checking problem for formulas with (formula) complexity r, relies on reducing the membership problem for g(⌈ r 2 ⌉ − 1, n) space bounded alternating Turing machines (ATM) to the model checking problem. The reduction requires identifying an encoding of computations of ATMs, which are trees, as strings that can be guessed and generated by pushdown systems. The pushdown system we construct for the model checking problem guesses potential computations of the ATM, while the sHCTL\* formula we construct checks if the guessed computation is a valid accepting computation.

Related work. Clarkson and Schneider introduced hyperproperties [6] and demonstrated their need to capture complex security properties. Temporal logics HyperLTL and HyperCTL\*, that describe hyperproperties, were introduced by Clarkson et al. [5]. They also characterized the complexity of model checking fnite state transition systems against HyperCTL\* specifcations by a reduction to the satisfability problem of QPTL [17]. Subsequently, other model checking algorithms for verifying fnite state systems against HyperCTL\* properties have been proposed [10,7]. Tools that check satisfability [8] and runtime verifcation [9] for HyperLTL formulas have also been developed. Finkbeiner et al. introduced the automata-theoretic approach to model checking HyperCTL\* for fnite-state systems [10].

<sup>3</sup> A pushdown alphabet is an alphabet that is partitioned into three sets: a set of call symbols, a set of internal symbols, and a set of return symbols. See Section 4.1.

<sup>4</sup> Our defnition of formula complexity is roughly double the usual notion of quantifer alternation. For a precise defnition, see Defnition 4.

The model checking problem for HyperLTL, and consequently Hyper-CTL\*, was shown to be undecidable for pushdown systems in [16]. For restricted fragments of HyperLTL, Pommellet and Tayssir [16] introduced overapproximations and under-approximations to establish/refute that a pushdown system satisfes a HyperLTL formula in those fragments. Gutsfeld et al. introduced stuttering Hµ, a linear time logic for checking asynchronous hyperproperties for recursive programs in [12]. The authors present complexity results for the model checking problem under an assumption of fairness, and a restriction of well-alignment. While the restriction to paths with the same stack access pattern is similar to the well-alignment restriction, we do not assume any fairness condition to establish decidability. However, as sHCTL\* is a branching time logic and only considers synchronous hyperproperties, the two logics are not directly comparable. It is also worth mentioning that the branching nature of sHCTL\* requires us to "copy" a potentially unbounded stack, from the most recently quantifed path variable, when assigning a path to the "current" quantifed path variable. In contrast, all path assignments in [12] start with an empty stack.

For lack of space reasons, some proofs are omitted and can be located in [2].

## 2 Motivation

Clarkson and Schneider [6] argue that many important security guarantees are expressible only as hyperproperties. We discuss two examples of security hyperproperties, and the logics HyperLTL and HyperCTL\* used to specify them.

Hyperproperties and temporal logics. We discuss two variants of noninterference [11] that model confdentiality requirements. In non-interference, the inputs of a system are partitioned into low-level input security variables and high-level input security variables. The attacker is assumed to know the values of low-level security inputs. During an execution, the attacker can observe parts of the system confguration such as system outputs, or the memory usage. A system satisfes non-interference if the attacker cannot deduce the values of high-level inputs from the low-level observations. To formally specify the variants, we use the logic HyperLTL [5], a fragment of the logic HyperCTL\* [5]. The precise syntax of HyperLTL and HyperCTL\* is shown in Fig. 1. In the syntax, π is a path variable and the formula a<sup>π</sup> is true if the proposition a is true along the path "π". Intuitively, the formula ∃π. ψ is existential quantifcation over paths, and is true if there is a path that can be assigned to π such that ψ is true. Universal quantifcation (∀π. ψ), and other logical connectives such as conjunction (∧), implication (→), equivalence (↔) and the temporal operators G and F can be defned in the standard way. By having explicit path variables, HyperLTL and HyperCTL\* allow quantifcation over multiple paths simultaneously.

Example 1. The frst variant, noninference [14], states that for each execution σ of a program, there is another execution σ ′ such that (a) σ ′ is obtained from σ by replacing the high-level security inputs by a dummy input, and (b) σ and σ ′ have the same low-level observations. Noninference is a hyperliveness property [5,6].

Let us assume that the low-level observations of a confguration are determined by the values of the propositions in L = {ℓ1, · · · ℓm}. As shown in [5], noninference is expressible by the HyperLTL formula: NI def = ∀π. ∃π ′ .(G λπ′ ) ∧ π ≡<sup>L</sup> π ′ . Here G λπ′ expresses that Globally (or in each confguration of the execution) the high input of π ′ is the dummy input λ, and π ≡<sup>L</sup> π ′ def = G(∧ℓ∈L(ℓ<sup>π</sup> ↔ ℓπ′ )) expresses that π and π ′ have the same low-level observations.

Example 2. The second variant, observational determinism [13,19], states that any two executions that have the same low-level initial inputs, must have the same low-level output observations. Observational determinism is a hypersafety property [5,6], and is also expressible in HyperLTL using the formula [5]: OD def <sup>=</sup> ∀π. ∀π ′ .(π[0] ≡L,in π ′ [0]) →π ≡L,out π ′ . Here ≡L,in and ≡L,out express the fact that π and π ′ have the same low-security inputs and outputs respectively.

Procedural (recursive) programs and Stack-aware hyperproperties. Pushdown systems model procedural programs that do not dynamically allocate memory, and whose program variables take values in fnite domains. Unlike fnite-state transition systems, the problem of checking whether a pushdown system satisfes a HyperCTL\* formula is undecidable [16]. However, we identify a natural class of hyperproperties for which the model checking problem becomes decidable. As we shall shortly see, this class of hyperproperties not only enjoys decidability, but is also useful in reasoning about security hyperproperies such as noninference and observational determinism.

We consider a restricted class of hyperproperties for recursive programs, which relate only executions that access the call stack in the same manner, i.e., push or pop at the same time. An execution of a pushdown system P is a sequence of confgurations (control state + stack) σ = c1c<sup>2</sup> · · · , such that the stacks of consecutive confgurations c<sup>i</sup> and ci+1 difer only due to the possible presence of an additional element at the top of the stack of either c<sup>i</sup> or ci+1. For such a sequence, we can associate a sequence pr(σ) = o1o<sup>2</sup> · · · such that o<sup>i</sup> ∈ {call, int,ret} such that o<sup>i</sup> = call (ret respectively) if and only if the stack in ci+1 has one more (less respectively) element than c<sup>i</sup> . The sequence pr(σ) is said to be the stack access pattern of σ. Observe that the stack sizes of two executions with the same stack access pattern evolve in a similar fashion. Thus, equivalently, we can consider this class of hyperproperties to be the hyperproperties that relate executions with identical memory usage.

To specify these hyperproperties, we propose the logic sHCTL\* which extends HyperCTL\*. sHCTL\* has a two level syntax. At the innermost level, the syntax is identical to that of HyperCTL\* formulas, but is interpreted over executions of the pushdown system with the same stack access pattern. At the outer level, we quantify over diferent stack access patterns. Intuitively, the formula Eψ is true if there is a stack access pattern ρ exhibited by the system such that the set of executions with access pattern ρ satisfy the hyperproperty ψ. The dual formula Aψ, defned as ¬E¬ψ, is true if for each stack access pattern ρ exhibited by the system, the set of all executions with stack access pattern ρ satisfy ψ. The syntax of sHLTL is obtained from HyperLTL in a similar fashion. Please see Fig. 1 on Page 8 for a side-by-side comparison of the syntax of HyperCTL\* (HyperLTL) and sHCTL\* (sHLTL). Unlike HyperCTL\*, we show that the problem of checking sHCTL\* is decidable for pushdown systems (Theorem 3). Formal defnitions of stack access patterns, syntax and semantics of sHCTL\* are in Section 3.

For the rest of the paper, hyperproperties expressible in sHCTL\* will be called stack-aware hyperproperties. Restricting to stack-aware hyperproperties is useful in verifying security guarantees of recursive programs as discussed below.

Proving ∀∃<sup>∗</sup> hyperproperties. The noninference property (Example 1) can be expressed in HyperLTL as NI def <sup>=</sup> <sup>∀</sup>π. <sup>∃</sup>π. ′ (G λπ′ ) ∧ π ≡<sup>L</sup> π ′ . Consider the sHLTL formula A(NI) obtained by putting an A in front NI. A pushdown system satisfes A(NI) only if for each execution σ of the system, there is another execution σ ′ with the same stack access pattern as σ such that σ, σ′ together satisfy (G λσ′ ) ∧ σ ≡<sup>L</sup> σ ′ . Thus, if the pushdown system satisfes the sHLTL formula A(NI), then it also satisfes noninference. Thus, a decision procedure for sHLTL can be used to prove that a recursive program satisfes noninference.

The above observation generalizes to HyperLTL formulas of the form ψ = ∀π.∃π1. . . . ∃πk.ψ′ — if a system satisfes the sHLTL formula Aψ then it must also satisfy the HyperLTL formula ψ. Though the model checking problem is undecidable for pushdown systems even when restricted to such HyperLTL formulas, we gain decidability by restricting the search space for π, π1, . . . , πk.

Refuting ∀ <sup>∗</sup> hyperproperties. Observational determinism (Example 2) can be written in HyperLTL as OD def <sup>=</sup> <sup>∀</sup>π. <sup>∀</sup><sup>π</sup> ′ .(π[0] ≡L,in π ′ [0]) →π ≡L,out π ′ . Consider the sHLTL formula A(OD). A pushdown system fails to satisfy the sHLTL formula A(OD) only if there is a stack access pattern ρ and executions σ<sup>1</sup> and σ<sup>2</sup> with stack access pattern ρ such that the pushdown system does not satisfy (σ[0] ≡L,in σ ′ [0]) →σ ≡L,out σ ′ .

This observation generalizes to HyperLTL formulas of the form ψ = ∀π1. . . . ∀πk.ψ′ — if a pushdown system fails to satisfy the sHLTL formula Aψ then it does not satisfy ψ. Even though model checking pushdown systems against such restricted specifcations is undecidable, our decision procedure can be used to show that a recursive program does not meet such properties.

Exact verifcation when stack access pattern is observable. Often, it is reasonable to assume that the attacker can observe the stack access pattern. For example, in the program counter security model [15], the attacker has access to the program counter transcript, i.e., the sequence of program counters during an execution. Access to the program counter transcript implies that the attacker can observe stack access pattern. The assumption that the program counter transcript is observable helps model control fow side channel attacks which include timing attacks and error disclosure attacks [15]. sHCTL\* can be used to verify security guarantees in this security model. For example, the sHCTL\* formula A( NI) models noninference faithfully by introducing a unique proposition for each control state. Observational determinism can also be verifed in this model by suitably transforming the pushdown automaton.

Another scenario in which stack access patterns are observable is when the attacker can observe the memory usage of a program in terms of stack size. As observing stack size may lead to information leakage, stack size should be considered a low-level observation. Since the stack size can be unbounded, it cannot be modeled as a proposition. sHCTL\*, however, can still be used to verify security guarantees in this case. For example, A( NI) = A(∀π. ∃π. ′ (G λπ′ ) ∧ π ≡<sup>L</sup> π ′ ) faithfully models non-inference as semantics of sHCTL\* forces π and π ′ to have the same call-stack size in addition to other low-level observations. Once again, observational determinism can also be verifed in this model by suitably transforming the pushdown automaton.

## 3 Stack-aware Hyper Computation Tree Logic (sHCTL\*)

Stack-aware Hyper Computation Tree Logic (sHCTL\*), and its sub-logic Stackaware Hyper Linear Temporal Logic (sHLTL) are formally presented. We begin by establishing some conventions over strings.

Strings. A string/word w over a fnite alphabet Σ is a sequence w = a0a<sup>1</sup> · · · of fnite or infnitely many symbols from Σ, i.e., a<sup>i</sup> ∈ Σ for all i. The length of a string w, denoted |w|, is the number of symbols appearing in it — if w = a0a<sup>1</sup> · · · an−<sup>1</sup> is fnite then |w| = n, and if w = a0a<sup>1</sup> · · · is infnite then |w| = ω. The unique string of length 0, the empty string, is denoted ε. For a string w = a0a<sup>1</sup> · · · a<sup>i</sup> · · · , w(i) = a<sup>i</sup> denotes the ith symbol, w[ : i] = a0a<sup>1</sup> · · · ai−<sup>1</sup> denotes the prefx of length i, w[i : ] = aiai+1 · · · denotes the sufx of w starting at position i, and w[i : j] = aiai+1 · · · aj−<sup>1</sup> denotes the substring from position i (included) to position j (not included). Thus w[0 : ] = w. By convention, when i ≤ 0, we take w[ : i] = ε. Over Σ, the set of all fnite strings is denoted Σ<sup>∗</sup> , and the set of all infnite strings is denoted Σ<sup>ω</sup>. For a fnite string u and a (fnite or infnite) string v, uv denotes the concatenation of u and v.

#### 3.1 Pushdown Systems

Pushdown systems naturally model for sequential recursive programs. Formally, an AP-labeled pushdown system is a tuple P = (S, Γ, sin, ∆, L), where S is a fnite set of control states, Γ is a fnite set of stack symbols, sin ∈ S is the initial control state, L : S → 2 AP is the labeling function, and ∆ is the transition relation. The transition relation ∆ = ∆int ∪· ∆call ∪· ∆ret is the disjoint union of internal transitions ∆int ⊆ S × S where the stack is unchanged, call transitions ∆call ⊆ S × (S × Γ) where a single symbol is pushed onto the stack, and return transitions ∆ret ⊆ (S × Γ) × S where a single symbol is popped from the stack. When AP is clear from the context, we simply refer to them as pushdown systems. Transition System Semantics. We recall the standard semantics of a pushdown system as a transition system. Let us fx a pushdown system P = (S, Γ, sin, ∆, L). A confguration c of P is a pair (s, α) where s ∈ S and α ∈ Γ ∗ .

a ∈ AP, π ∈ V ψ ::= a<sup>π</sup> | ¬ψ | ψ ∨ ψ | Xψ | ψ U ψ | ∃π. ψ (a) HyperCTL\* θ ::= Eψ | ¬θ | θ ∨ θ ψ ::= a<sup>π</sup> | ¬ψ | ψ ∨ ψ | Xψ | ψ U ψ | ∃π. ψ (b) sHCTL\*

Fig. 1: BNF for HyperCTL\* and sHCTL\*. Let ∀ denote ¬∃¬ and A denote ¬E¬ψ. HyperLTL is the set of HyperCTL\* formulas Q1π1. · · · Qrπr.ψ where Q<sup>i</sup> ∈ {∃, ∀} and ψ is quantifer-free. sHLTL is the set of sHCTL\* formulas qφ, where q ∈ {A, E} and φ is in HyperLTL.

The set of all confgurations of P will be denoted Conf<sup>P</sup> = S × Γ ∗ . The labeled transition system associated with <sup>P</sup> is <sup>J</sup>P<sup>K</sup> := (Conf<sup>P</sup> , cin, −→, AP, L) where cin = (sin, ε) is the initial confguration, −→⊆ Conf<sup>P</sup> × ({call, ret, int} × S × (Γ ∪{ε})×S)×Conf<sup>P</sup> is the transition relation, and L is the labeling function that extends the labeling function of P to confgurations as follows: L(s, α) = L(s). The transition relation −→ is defned to capture the informal semantics of internal, call, and return transitions — for any α ∈ Γ ∗ , (int) (s, α) (int,s,ε,s′ ) −−−−−−→ (s ′ , α) if (s, s′ ) ∈ ∆int; (call) (s, α) (call,s,a,s′ ) −−−−−−−→ (s ′ , aα) if (s,(s ′ , a)) ∈ ∆call; and (ret) (s, aα) (ret,s,a,s′ ) −−−−−−→ (s ′ , α) if ((s, a), s′ ) ∈ ∆ret.

<sup>A</sup> path of <sup>J</sup>P<sup>K</sup> is an infnite sequence of confgurations <sup>σ</sup> <sup>=</sup> <sup>c</sup>0, c1, . . . such that for each i, c<sup>i</sup> (o,s,a,s′ ) −−−−−−→ ci+1 for some o ∈ {int, call,ret}, s, s′ ∈ S and a ∈ Γ ∪ {ε}. The path σ is said to start in confguration c<sup>0</sup> (the frst confguration in the sequence). We will use Paths(JPK, c) to denote the set of paths of <sup>J</sup>P<sup>K</sup> starting in the confguration <sup>c</sup> and Paths(JPK) to denote all paths of <sup>J</sup>PK.

We conclude this section by introducing some notation on confgurations. For c = (s, α), its stack height is |α|, its control state is state(c) = s, and its top of stack symbol is top(c) = a ∈ Γ if α = aα′ and is undefned if α = ε.

## 3.2 Syntax of sHCTL\*

Let us fx a set of atomic propositions AP, and a set of path variables, V. The BNF grammar for sHCTL\* formulas is given in Figure 1(b). In the BNF grammar, a ∈ AP is an atomic proposition, π is a path variable, ψ is a cognate formula, and θ is a sHCTL\* formula. The syntax has two levels, with the inner level identical to HyperCTL\* formulas, while the outer level allows quantifcation over diferent stack access patterns (see Section 3.3). Also, following [5,10], we assume that the until operator U only occurs within the scope of a path quantifer.

Remark 1. We have chosen to not have A, the dual of E, and conjunction as explicit logical operators to keep our exposition simple. This choice does makes the automata constructions presented here less efcient for formulas involving conjunction. Adding them explicitly does not pose a technical challenge to our setup and our automata constructions can be extended to handle them explicitly. In addition, we will sometimes use other quantifers and logical operators to write formulas. Some standard examples include: θ<sup>1</sup> ∧ θ<sup>2</sup> = ¬(¬θ<sup>1</sup> ∨ ¬θ2), where θ<sup>i</sup> (i ∈ {1, 2}) is either a sHCTL\* or cognate formula; ∀π.ψ = ¬ ∃π. ¬ψ; F ψ = true Uψ, where true = a<sup>π</sup> ∨ ¬aπ; G ψ = ¬ F ¬ψ.

We call formulas of the form qψ (where q ∈ {A, E} and ψ is a cognate formula) basic formulas. Observe that any sHCTL\* formula is a Boolean combination of basic formulas. A sHCTL\* formula θ is a sentence if in each basic sub-formula qψ, ψ is a sentence, i.e., every path variable appearing in ψ is quantifed. Without loss of generality, we assume that in any cognate formula ψ, all bound variables in ψ are renamed to ensure that any path variable is quantifed at most once. We will only consider sHCTL\* sentences in this paper. The logic sHLTL is the sub-logic of sHCTL\* consisting of all formulas of the form qQ1π1. · · · Qrπr.ψ where q ∈ {A, E}, Q<sup>i</sup> ∈ {∃, ∀} and ψ is quantifer free.

## 3.3 Semantics of sHCTL\*

The syntax of cognate formulas is identical to that HyperCTL\* formulas. Their semantics will be described in a similar manner, in a context where free path variables in the formula are interpreted as executions of a system. However, we will require that the interpretations of every path variable share a common stack access pattern — hence the term cognate. Thus, before defning the semantics, we will defne what we mean by the stack access pattern of a path and a path environment that assigns an interpretation to path variables.

For the rest of this section let us fx a pushdown system P = (S, Γ, sin, ∆, L). A string w ∈ {call, int,ret} ∗ is said to be well matched if either w = ε or w = int or w = call u ret or w = uv, where u, v ∈ {call, int,ret} <sup>∗</sup> are (recursively) well matched. In a string ρ ∈ {call, int,ret} <sup>ω</sup>, ρ(i) is an unmatched return, if ρ[ : i + 1] = w ret, where w is well matched. We are now ready to present the defnition of a stack access pattern.

Defnition 1 (Stack access pattern). A string ρ ∈ {call, int,ret} <sup>ω</sup> is a stack access pattern if the set {i ∈ N | ρ(i) is an unmatched return} is fnite.

A path <sup>σ</sup> <sup>=</sup> <sup>c</sup>0c1c<sup>2</sup> · · · ∈ Paths(JPK) is said to have a stack access pattern <sup>ρ</sup> <sup>=</sup> o0o<sup>1</sup> · · · (denoted pr(σ) = ρ) if for every i: (a) o<sup>i</sup> = call if and only if stack(ci+1) = top(ci+1)stack(ci), (b) o<sup>i</sup> = int if and only if stack(ci+1) = stack(ci), and (c) o<sup>i</sup> = ret if and only if stack(ci) = top(ci)stack(ci+1).

We now present the defnition of path environment that interprets the free path variables in a cognate formula as paths of <sup>J</sup>P<sup>K</sup> such that they share a common stack access pattern. This plays a key role in defning the semantics of sHCTL\*. For a set of path variables V, let V † be defned as the set V ∪· {†}.

Defnition 2 (Path Environment). A path environment for pushdown system P over variables V is function Π : V † <sup>→</sup> Paths(JPK) ∪{call, int,ret} <sup>ω</sup> such that <sup>Π</sup>(†) is a stack access pattern , and for every <sup>π</sup> ∈ V, <sup>Π</sup>(π) <sup>∈</sup> Paths(JPK) with pr(Π(π)) = Π(†). When the pushdown system is clear from the context, we will simply refer to it as a path environment over V.

When <sup>V</sup> <sup>=</sup> <sup>∅</sup>, we additionally require that there is a path <sup>σ</sup> <sup>∈</sup> Paths(JPK, cin) (where <sup>c</sup>in is the initial confguration of <sup>J</sup>PK) such that pr(σ) = <sup>Π</sup>(†).

We introduce some notation related to path environments. Let us fx a path environment <sup>Π</sup> over variables <sup>V</sup>. Given a path <sup>σ</sup> <sup>∈</sup> Paths(JPK), <sup>Π</sup>[<sup>π</sup> 7→ <sup>σ</sup>] denotes the path environment over V ∪{π} such that Π[π 7→ σ](π) = σ, and Π[π 7→ σ](π ′ ) = Π(π ′ ), for any π ′ ∈ V† with π ′ ≠ π. Finally, for i ∈ N, Π[i : ] denotes the sufx path environment, where every variable is mapped to the sufx of the path starting at position i. More formally, for every π ′ ∈ V† , Π[i : ](π ′ ) = Π(π ′ )[i : ].

We now defne when a pushdown system P satisfes a sHCTL\* sentence θ, denoted P |= θ. The defnition of satisfaction of θ relies on a defnition of satisfaction for cognate formulas. To inductively to defne the semantics of cognate formulas, we will interpret free path variables using a path environment. Finally, as in HyperCTL\*, it is important to track the most recently quantifed path variable because that infuences the semantics of ∃π(·). Thus satisfaction of cognate formulas takes the form P, Π, π′ |= ψ, where π ′ is the most recently quantifed path variable, and Π is a path environment over the free variables of ψ. Finally, by convention, we will take Paths(JPK, Π(†)(0)) to mean Paths(JPK, cin), where <sup>c</sup>in is the initial confguration of <sup>J</sup>P<sup>K</sup> 5 . Below, θ, θ1, and θ<sup>2</sup> are sHCTL\* sentences, while ψ, ψ1, ψ<sup>2</sup> are cognate formulas.

P |= ¬θ if P ̸|= θ P |= θ<sup>1</sup> ∨ θ<sup>2</sup> if P |= θ<sup>1</sup> or P |= θ<sup>2</sup> P |= Eψ if for some path environment Π over ∅,P, Π, † |= ψ P, Π, π′ |= a<sup>π</sup> if a ∈ L(Π(π)(0)) P, Π, π′ |= ¬ψ if P, Π, π′ ̸|= ψ P, Π, π′ |= ψ<sup>1</sup> ∨ ψ<sup>2</sup> if P, Π, π′ |= ψ<sup>1</sup> or P, Π, π′ |= ψ<sup>2</sup> P, Π, π′ |= Xψ if P, Π[1 : ], π′ |= ψ P, Π, π′ |= ψ<sup>1</sup> Uψ<sup>2</sup> if ∃i ≥ 0 : P, Π[i : ], π′ |= ψ<sup>2</sup> and ∀j, 0 ≤ j < i, P, Π[j : ], π′ |= ψ<sup>1</sup> P, Π, π′ <sup>|</sup><sup>=</sup> <sup>∃</sup>π. ψ if <sup>∃</sup><sup>σ</sup> <sup>∈</sup> Paths(JPK, Π(<sup>π</sup> ′ )(0)) with pr(σ) = Π(†), such that P, Π[π 7→ σ], π |= ψ

## 4 A Decision Procedure for sHCTL\*

Given a pushdown system P and a sHCTL\* sentence θ, we present an algorithm that determines if P |= θ. Our approach is similar to the one in [10]. Given a fnite state transition system K and a HyperCTL\* formula φ, Finkbeiner et. al. [10], construct an alternating (fnite state) Büchi automaton A<sup>K</sup>,φ, by induction on φ, such that an input word σ is accepted by A<sup>K</sup>,φ if and only if σ is the encoding

<sup>5</sup> The convention is needed because Π(†)(0) is not a confguration but an element of the set {call, int,ret}.

of a path environment Π such that K, Π |= φ. Determining if K |= φ then reduces to checking if A<sup>K</sup>,φ accepts any string.

Extending these ideas to sHCTL\* and pushdown systems, requires one to answer two questions: (a) What is an encoding of path environments for cognate formulas where path variables are mapped to sequences of confgurations (control state + stack)?; (b) Which automata models can capture the collection of path environments satisfying a cognate formula with respect to a pushdown system? We encode path environments for cognate formulas using strings over a pushdown alphabet — pushdown tags on symbols adds structure that helps encode sequences of confgurations. And for automata, we consider automata that process such strings and accept visibly pushdown languages. A natural generalization of the approach outlined in [10] would suggest the use of alternating visibly pushdown automata (AVPA) on infnite strings [4]. However, using AV-PAs results in an inefcient algorithm. To get a more efcient algorithm, we instead rely on a careful use of nondeterministic visibly pushdown automata (NVPA) [1] and 1-way alternating jump automata (1-AJA) [4]. The advantage of using NVPA and 1-AJA can be seen in the case of existential quantifcation (∃π.) which requires converting an alternating automaton to a nondeterministic one [10]: Converting from 1-AJA to NVPA leads to exponential blowup while converting AVPA to NVPA leads to a doubly exponential blowup [4].

The rest of this section is organized as follows. We begin by introducing the automata models on pushdown alphabets (Section 4.1). Next we present our encoding of path environments, and fnally our automata constructions that establish the decidability result (Section 4.2).

### 4.1 Automata on Pushdown Alphabets

A pushdown alphabet is a fnite set Σ that is partitioned into three sets Σcall ∪· Σint ∪· Σret, where Σcall is the set of call symbols, Σint is the set of internal symbols, and Σret is the set of return symbols. Automata models processing strings over a pushdown alphabet are restricted to perform certain types of transitions based on whether the read symbol is a call, internal, or return symbol. We introduce, informally, two such automata models next. Precise defnition and its semantics can be found in the detailed version of this paper [2].

Nondeterministic Visibly Pushdown Büchi Automata. A nondeterministic visibly pushdown automaton (NVPA) [1] is like a pushdown system. It has fnitely many control states and uses an unbounded stack for storage. However, unlike a pushdown system, it is an automaton that processes an infnite sequence of input symbols from a pushdown alphabet Σ = Σcall ∪· Σint ∪· Σret. Transitions are constrained to conform to pushdown alphabet — whenever a Σcall symbol is read, a symbol onto the stack, whenever a Σret symbol is read, the top stack symbol is popped, and whenever Σint symbol is read, the stack is unchanged.

1-way Alternating Jump Automata. Our second automaton model is 1 way Alternating Parity Jump Automata (1-AJA) [4]. 1-AJA are computationally equivalent to NVPAs (i.e., accept the same class of languages) but provide greater fexibility in describing algorithms. 1-AJAs are alternating automata, which means that they can defne acceptance based on multiple runs of the machine on an input word. Though they are fnite state machines with no auxiliary storage, their ability to spawn a computation thread that jumps to a future portion of the input string on reading a symbol, allows them to have the same computational power as a more conventional machine with storage (like NVPAs).

We present some useful properties of NVPA and 1-AJA. The two models are equi-expressive with the size of automata constructed by the translation known.

Theorem 1 ([4]). For any NVPA N of size n, there is a 1-AJA A<sup>N</sup> of size O(n 2 ), such that L(A<sup>N</sup> ) = L(N). Conversely, for any 1-AJA A of size n, there is a NVPA N<sup>A</sup> of size 2 O(n) , such that L(NA) = L(A). Constructions can be carried out in time proportional to the size of the resulting automaton.

Both 1-AJA and NVPAs are closed for language operations like complementation, union and prefxing. We also recall the following result.

Theorem 2 ([1]). For NVPAs, the emptiness problem is PTIME-complete.

## 4.2 Algorithm for sHCTL\*

Let us fx a pushdown system P = (S, Γ, sin, ∆, L) and a sHCTL\* sentence θ. Our goal is to decide if P |= θ. We will reduce this problem to checking the emptiness of multiple NVPAs (Theorem 2). Our approach is similar to [10] — for each cognate sub-formula ψ (not necessarily sentence) of θ, we will compositionally construct an automaton that accepts the path environments satisfying ψ. Path environments will be encoded by strings over pushdown alphabets as follows.

For a path <sup>σ</sup> <sup>=</sup> <sup>c</sup>0c1c<sup>2</sup> · · · of <sup>J</sup>PK, the trace of <sup>σ</sup>, denoted tr(σ), is the (unique) sequence (o0, q0, a0, q1)(o1, q1, a1, q2)· · · such that for every i ∈ N, ci (oi,qi,ai,qi+1) −−−−−−−−−→ ci+1 where o<sup>i</sup> ∈ {call, int,ret}, q<sup>i</sup> , qi+1 ∈ Q, and a<sup>i</sup> ∈ Γ ∪ {ε} 6 .

While tr(σ) is uniquely determined by the path σ, the converse is not true — diferent paths may have the same trace. To see this, consider the following example. For confguration c and γ ∈ Γ ∗ , let γ(c) denote the confguration (state(c),stack(c)γ), i.e., the confguration with the same control state, but with stack containing the symbols in γ at the bottom. Observe that, for any γ ∈ Γ ∗ , if σ = c0c1c2· is a path then so is γ(σ) = γ(c0)γ(c1)γ(c2)· · · . Additionally, tr(σ) = tr(γ(σ)). Two paths <sup>σ</sup><sup>1</sup> and <sup>σ</sup><sup>2</sup> of <sup>J</sup>P<sup>K</sup> will be said to be equivalent if tr(σ1) = tr(σ2) and will be denoted as σ<sup>1</sup> ≃ σ2. Observe that equivalent paths have the same stack access pattern , i.e. if σ<sup>1</sup> ≃ σ<sup>2</sup> then pr(σ1) = pr(σ2). The semantics of sHCTL\* doesn't distinguish between equivalent paths.

<sup>6</sup> Observe that even when <sup>σ</sup> is not a path in <sup>J</sup>P<sup>K</sup> (i.e., corresponds to an actual sequence of transitions of P), the trace of σ is uniquely defned as long as stacks of successive confgurations of σ can be obtained by leaving the stack unchanged, or pushing/popping one symbol.

Proposition 1. Let φ be a cognate formula with V as the set of free path variables. Let Π<sup>1</sup> and Π<sup>2</sup> be two path environments such that for every π ∈ V, Π1(π) ≃ Π2(π). Then, P, Π1, π |= φ if and only if P, Π2, π |= φ.

The proof of Proposition 1 follows by induction on cognate formulas. Proposition 1 establishes that the set of path environments satisfying a cognate formula is a union of equivalence classes with respect to path equivalence. Thus, instead of constructing automata that accept path environments, we will construct automata that accept mappings from path variables to traces of paths. For m ∈ N, let Σ[m] = Σ[m]call ∪· Σ[m]int ∪· Σ[m]ret be the pushdown alphabet where Σ[m]call = {call} × S <sup>m</sup> × Γ <sup>m</sup>, Σ[m]int = {int} × S <sup>m</sup> × {ε} <sup>m</sup>, and Σ[m]ret = {ret} × S <sup>m</sup> × Γ <sup>m</sup>. Observe Σ[0] is (essentially) the set {int, call,ret}.

Defnition 3 (Encoding Path Environments). Consider a set of m path variables V = {π1, π2, . . . πm}. A string w ∈ Σ[m] <sup>ω</sup> where for any j ∈ N, w(j) = (o<sup>j</sup> ,(s j 1 , s j 2 , . . . s<sup>j</sup> <sup>m</sup>),(a j 1 , a j 2 , . . . a<sup>j</sup> <sup>m</sup>)) encodes all path environments Π such that

$$\begin{array}{l} \Pi(\dagger) = \mathbf{o}\_0 \mathbf{o}\_1 \mathbf{o}\_2 \cdots \mathbf{o}\_j \cdots \\ \mathbf{tr}(\boldsymbol{H}(\pi\_i)) = (\mathbf{o}\_0, s\_i^0, a\_i^0, s\_i^1)(\mathbf{o}\_1, s\_i^1, a\_i^1, s\_i^2)\cdots \end{array}$$

for any i ∈ {1, 2, . . . m}. The string encoding a path environment Π is denoted as enc(Π) (= w, in this case).

Based on the defnitions, the following observation about traces and encodings can be concluded.

Proposition 2. For any path <sup>σ</sup> <sup>∈</sup> Paths(JPK) and <sup>i</sup> <sup>∈</sup> <sup>N</sup>, tr(σ[<sup>i</sup> : ]) = tr(σ)[<sup>i</sup> : ]. For any path environment Π and i ∈ N, enc(Π[i : ]) = enc(Π)[i : ].

The encoding of path environments as strings over Σ[m] (for an appropriate value of m) is used in our decision procedure, which compositionally constructs automata that accept path environments satisfying each cognate formula. The size of our constructed automata, like in [10], will be tower of exponentials that depends on the formula complexity of the cognate formula φ.

Defnition 4 (Formula Complexity). The formula complexity of a sHCTL\* formula φ, denoted fc(φ), is inductively defned as follows. Let odd : N → N be the function that maps a number n to the smallest odd number ≥ n, i.e., odd(n) = n if n is odd and odd(n) = n + 1 if n is even. Similarly, even : N → N maps n to the smallest even number ≥ n, i.e., even(n) = odd(n + 1) − 1. Below ψ1, ψ<sup>2</sup> denote cognate formulas, and θ1, θ<sup>2</sup> denote sHCTL\* sentences.

$$\begin{array}{lll} \mathsf{fc}(a\_{\pi}) = 0 & \mathsf{fc}(\neg\psi\_{1}) = \mathsf{even}(\mathsf{fc}(\psi\_{1})) & \mathsf{fc}(\mathsf{X}\psi\_{1}) = \mathsf{fc}(\psi\_{1})\\ \mathsf{fc}(\psi\_{1}\vee\psi\_{2}) = \max(\mathsf{fc}(\psi\_{1}), \mathsf{fc}(\psi\_{2})) & \mathsf{fc}(\psi\_{1}\amalg\psi\_{2}) = \mathsf{even}(\max(\mathsf{fc}(\psi\_{1}), \mathsf{fc}(\psi\_{2})))\\ \mathsf{fc}(\exists\pi.\psi\_{1}) = \mathsf{od}(\mathsf{fc}(\psi\_{1})) & \mathsf{fc}(E\psi\_{1}) = \mathsf{od}(\mathsf{fc}(\psi\_{1}))\\ \mathsf{fc}(\neg\theta\_{1}) = \mathsf{fc}(\theta\_{1}) & \mathsf{fc}(\theta\_{1}\vee\theta\_{2}) = \max(\mathsf{fc}(\theta\_{1}), \mathsf{fc}(\theta\_{2})) \end{array}$$

Observe the diference in the defnition of fc(¬θ1) and fc(¬ψ1); for ¬θ<sup>1</sup> there is no change in formula complexity, while for ¬ψ<sup>1</sup> we move to the next even level.

Our main technical lemma is a compositional construction of an automaton for cognate formulas ψ. Depending on the parity of fc(ψ), the automaton we construct will either be a 1-AJA or a NVPA. Before presenting this lemma, we defne a function that is a tower of exponentials. For c, k, n ∈ N, the value gc(k, n) is defned inductively on k as follows: gc(0, n) = cn log n, and gc(k + 1, n) = 2 gc(k,n) . We use gO(1)(k, n) to denote the family of functions {gc(k, n) | c ∈ N}.

Lemma 1. Consider pushdown system P = (S, Γ, sin, ∆, L) and sHCTL\* sentence θ. Let ψ be a cognate subformula of θ with free path variables in the set V = {π1, . . . πm} for m ∈ N. We assume, without loss of generality, that the variables π1, . . . π<sup>m</sup> are in the order in which they are quantifed in θ with π<sup>m</sup> being the frst free variable of ψ that will be quantifed in the context θ. In addition, we assume that the size of both ψ and P is bounded by n. There is an automaton A<sup>ψ</sup> over pushdown alphabet Σ[m] such that for any path environment Π over V,

> P, Π, π<sup>m</sup> |= ψ if and only if enc(Π) ∈ L(Aψ). 7

The automaton A<sup>ψ</sup> is a NVPA if fc(ψ) is odd, and a 1-AJA if fc(ψ) is even. The size of A<sup>ψ</sup> is at most gO(1)(⌈ fc(ψ) 2 ⌉, n) 8 .

Before presenting the proof of Lemma 1, we would like to highlight a subtlety about its statement. The result guarantees that for valid path environments Π, encoding enc(Π) is accepted by A<sup>ψ</sup> if and only if Π satisfes ψ. It says nothing about path environments that are not valid. In particular, there may be functions that map path variables to traces that do not correspond to actual paths of <sup>J</sup>PK, but which are nonetheless accepted by Aψ. Notice, however, when ψ = ∃π. ψ<sup>1</sup> is a cognate sentence, a string over {call, int,ret} will, by conditions guaranteed in Lemma 1, be accepted if and only if it corresponds to a stack access pattern of a path from the initial state that satisfes ∃π. ψ1.

Proof (Sketch of Lemma 1). Our construction of A<sup>ψ</sup> will proceed inductively. The type of automaton constructed will be consistent with the parity of fc(ψ), i.e., an NVPA if fc(φ) is odd and a 1-AJA if fc(ψ) is even. We sketch the main ideas here, with the full proof in [2].

For aπ, ¬ψ1, ψ<sup>1</sup> ∨ ψ2, and Xψ1, the construction essentially proceeds by converting A<sup>ψ</sup><sup>i</sup> (i ∈ {1, 2}) if needed, into the type (NVPA or 1-AJA) of the target automaton using Theorem 1, and then using standard closure properties to combine them to get the desired automaton. In case of ψ = ψ<sup>1</sup> Uψ2, we frst convert (if needed) A<sup>ψ</sup><sup>i</sup> (i ∈ {1, 2}) into a 1-AJA. At each step, the automaton for ψ will choose to either run A<sup>ψ</sup><sup>2</sup> , or run A<sup>ψ</sup><sup>1</sup> and restart itself. Correctness relies on the fact that our encoding for path environments satisfes Proposition 2.

The most interesting case is that of ψ = ∃π. ψ1. We will frst convert (if needed) the automaton for ψ<sup>1</sup> into a NVPA A1. The automaton for ψ will essentially guess the encoding of a path that is consistent with the transitions of

<sup>7</sup> When m = 0, we take π<sup>m</sup> to be †.

<sup>8</sup> When the size of the specifcation ψ is considered constant, the size of A<sup>ψ</sup> is at most gO(1)(⌈ fc(ψ) 2 ⌉ − 1, n)

P, and check if assigning the guessed path to variable π satisfes ψ<sup>1</sup> by running the automaton A1. The additional requirement we have is that the guessed path start at the same confguration as the current confguration of the path assigned to variable π<sup>m</sup> which introduces some subtle challenges. In order to be able to guess a path, A<sup>ψ</sup> will keep track of P's control state in its control state, and use its stack to track P's stack operations along the guessed path. Since the stacks of all paths are synchronized, it makes it possible for A<sup>ψ</sup> to use its (single stack) to track the stack of both P and the stack of A1. ⊓⊔

Using Lemma 1, we can establish the main result of this section.

Theorem 3. Given a P = (S, Γ, sin, ∆, L) and a sHCTL\* sentence θ, the problem of determining if P |= θ is in ∪cDTIME(gc(⌈ fc(θ) 2 ⌉, n)), where n is a bound on the size of P and θ.

Proof. Recall that a sHCTL\* sentence is a Boolean combination of formulas of the form Eψ, where ψ is a cognate sentence. Results on whether P |= Eψ for each such subformula can be combined to determine whether P |= θ. Given this, the time to determine if P |= θ is at most the time to decide if P satisfes each subformula of the form Eψ plus O(n) (to compute the Boolean combination of these results). Next, recall that the construction in Lemma 1 ensures that for a cognate sentence of the form ∃π. ψ, L(A<sup>∃</sup>π. ψ) consists exactly of strings in {call, int,ret} <sup>ω</sup> that encode a path environment over ∅ that satisfy ∃π. ψ.

Consider a sHCTL\* sentence Eψ. Let π be a path variable that does not appear in the sentence ψ. Based on the semantics of sHCTL\* the following observation holds: P |= Eψ if and only if for some path environment Π over ∅, P, Π, † |= ∃π. ψ. Which is equivalent to saying that P |= Eψ if and only if L(A<sup>∃</sup>π. ψ) ̸= ∅. Since fc(Eψ) = fc(∃π. ψ), and the emptiness problem of NVPA can be decided in polynomial time (Theorem 2), our theorem follows. ⊓⊔

## 5 Lower Bound

In this section, we establish a lower bound for the problem of model checking sHCTL\* sentences against pushdown systems. Our proof establishes a hardness result for the sHLTL sub-fragment of sHCTL\*. Before presenting this lower bound, we introduce the function hc(·, ·), which is another tower of exponentials, inductively defned as follows: hc(0, n) = n, and hc(k + 1, n) = hc(k, n)· c hc(k,n) .

Theorem 4. Let P be a pushdown system and θ be a sHLTL sentence such that the sizes of both P and θ is bounded by n and fc(θ) = 2k − 1 for some k ∈ N. The problem of checking if P |= θ is DTIME(hc(k, n))-hard, for every c ∈ N.

Proof (Sketch). We sketch the main intuitions behind the proof. To highlight the novelties of this proof, it is useful to recall how NSPACE(hc(k−1, n))-hardness for HyperLTL model checking is proved [5]. The idea is to reduce the language of a nondeterministic hc(k−1, n) space bounded machine M to the model checking problem by constructing a fnite state transition system that guesses a run of M, and a HyperLTL formula that checks if the path is a valid accepting run.

To get the stricter bound of DTIME(hc(k, n)), we use the fact that we are checking pushdown systems. The stack of the pushdown system can be used to guess a tree, as opposed to a simple trace. Therefore, we reduce a hc(k − 1, n) space bounded alternating Turing machine, instead of a nondeterministic machine. Since ASPACE(f(n)) = DTIME(2<sup>O</sup>(f(n))) for f(n) ≥ log n, the theorem will follow if the reduction succeeds.

Recall that a run of an alternating Turing machine M is a rooted, labeled tree, where vertices are labeled by confgurations of M in a manner that is consistent with the transition function of M. To faithfully encode a tree as a sequence of symbols, we record the DFS traversal of the tree, making explicit the stack operations performed during such a traversal. Consider a labeled, rooted tree T with root r whose label is ℓ(r) with T<sup>1</sup> as a the left sub-tree and T<sup>2</sup> as the right sub-tree. The DFS traversal of T will push ℓ(r), traverse T<sup>1</sup> recursively, pop ℓ(r), push ℓ(r), traverse T2, and then pop ℓ(r). We will use such a DFS traversal to guess and encode runs of M. Popping and pushing ℓ(r) between the traversals of T<sup>1</sup> and T<sup>2</sup> may seem redundant. Why not simply do nothing between the traversals of T<sup>1</sup> and T2? For T to be a valid run of M, the confguration labeling of the root of T<sup>2</sup> must be the result of taking one step from ℓ(r). Such checks will be encoded in our sHLTL sentence, and for that to be possible, we need successive confgurations of M to be consecutive in the string encoding.

To highlight some additional consistency checks, let us continue with our example tree T from the previous paragraph. For a string to be a correct encoding of T, it is necessary that the string pushed before the traversal of T<sup>i</sup> (i ∈ {1, 2}) be the same as the string popped after the traversal. This can be ensured by the pushdown system by actually pushing and popping those symbols. In addition, the string popped after T1's traversal must be the same as the string pushed before T2's traversal. Neither the stack nor the fnite control of the pushdown system can be used to ensure this. Instead this must be checked by the sHLTL sentence we construct. But the symbols while popping ℓ(r) will be in reverse order of the symbols being pushed, and it is challenging to perform this check in the formula. To overcome this, we push/pop the label and its reverse at the same time. This ensures that if we want to check if a string pushed is the same as a string that was just popped, then we can check for string equality, and this check is easier to do using formulas in sHLTL. Additional checks to ensure that the tree encodes a valid accepting run are performed by the sHLTL sentence using ideas from [17]. Full details can be found in [2]. ⊓⊔

## 6 Conclusions

In this paper, we introduced a branching time temporal logic sHCTL\* that can be used to specify synchronous hyperproperties for recursive programs modeled as pushdown systems. The primary diference from the standard branching time logic HyperCTL\* for synchronous hyperproperties is that sHCTL\* considers a restricted class of hyperproperties, namely, those that relate only executions that the same stack access pattern. We call such hyperproperties stack-aware hyperproperties. We showed that the problem of model checking pushdown systems sHCTL\* specifcations is decidable, and characterized its complexity. We also showed how this result can potentially be used to aid security verifcation.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/ licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Proofs**

## Propositional Proof Skeletons<sup>⋆</sup>

Joseph E. Reeves1() , Benjamin Kiesl-Reiter<sup>2</sup> , and Marijn J. H. Heule<sup>1</sup>,<sup>2</sup>

> <sup>1</sup> Carnegie Mellon University, Pittsburgh, PA, USA {jereeves,mheule}@cs.cmu.edu <sup>2</sup> Amazon Web Services, Seattle, WA, USA benkiesl@amazon.com

Abstract. Modern SAT solvers produce proofs of unsatisfability to justify the correctness of their results. These proofs, which are usually represented in the well-known DRAT format, can often become huge, requiring multiple gigabytes of disk storage. We present a technique for semantic proof compression that selects a subset of important clauses from a proof and stores them as a so-called proof skeleton. This proof skeleton can later be used to efciently reconstruct a full proof by exploiting parallelism. We implemented our approach on top of the award-winning SAT solver CaDiCaL and the proof checker DRAT-trim. In an experimental evaluation, we demonstrate that we can compress proofs into skeletons that are 100 to 5,000 times smaller than the original proofs. For almost all problems, proof reconstruction using a skeleton improves the solving time on a single core, and is around fve times faster when using 24 cores.

Keywords: SAT solving · proofs · compression.

## 1 Introduction

Solvers for the Boolean satisfability problem (SAT) take as input a formula of propositional logic and decide if the formula is satisfable. In case of satisfability, they usually return an assignment of truth values to the variables of the formula; by plugging these truth values into the formula, users can easily convince themselves that the solver was right and that the formula is indeed satisfable. In case of unsatisfability, however, things are more complicated: to justify their answer, solvers need to produce an independently checkable proof that none of the—exponentially many—potential truth assignments make the formula true.

In practical SAT solving, proofs of unsatisfability are represented in the DRAT format [10], and they are often huge, requiring several gigabytes (in some cases even terabytes [12] or petabytes [11]) of disk storage. Storing proofs is thus costly, especially since users might not require access to the proofs until sometime long after solving, at a point when proof verifcation or further analysis is desired.

© The Author(s) 2023

<sup>⋆</sup> Supported by the U.S. National Science Foundation under grant CCF-2229099, and supported in part by a fellowship award under contract FA9550-21-F-0003 through the National Defense Science and Engineering Graduate (NDSEG) Fellowship Program, sponsored by the Air Force Research Laboratory (AFRL), the Ofce of Naval Research (ONR) and the Army Research Ofce (ARO).

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. https://doi.org/10.1007/978-3-031-30823-9\_17 329–347, 2023.

Up to now, the only options to deal with this problem were either to not store proofs and instead recompute them on demand—a laborious but plausible approach considering that proof checking typically takes longer than solving—or to use compression methods to reduce proof size. However, syntactic compression techniques (such as LZMA or DEFLATE, as supported by the ZIP fle format) only provide moderate levels of compression. The same can be said about existing semantic compression techniques for proofs in SAT and SMT (c.f. [4, 18, 21]), which only achieve 20% compression on average.

In this paper, we present a novel approach to semantic compression that stores only a small subset of the clauses derived by a solver, called a proof skeleton. We can achieve strong compression rates with proof skeletons (around 100 to 5,000 times smaller than the original proof), while still retaining enough information to allow for a quick on-demand reconstruction of a complete proof that might difer from the original proof. This is similar to how a mathematician might put down the most important reasoning steps of a proof in a proof sketch, enabling a moderately talented reader to fll in the gaps. In our case, the gaps can even be flled independently, meaning that multiple readers can work in parallel.

We present both an online version (creating a proof skeleton during solving) and an ofine version (creating a proof skeleton from a full proof) of our approach. We select the clauses that end up in a proof skeleton by relying on several heuristics such as glue (a heuristic used internally by solvers to estimate the usefulness of clauses) for online and clause activity (a measure of how often a clause is used to derive new clauses) for ofine. To reconstruct a full proof from a proof skeleton, we utilize multiple incremental SAT solvers that can run in parallel. We implemented all our algorithms on top of the award-winning SAT solver CaDiCaL [2] and the proof checker DRAT-trim [22]. In an extensive empirical evaluation, we demonstrate the feasibility of our approach, with all code and data available at https://github.com/amazon-science/unsat-proof-skeletons.

Beyond being a tool for compression, proof skeletons can also serve as a source of insight into a solver's reasoning. Getting any sort of intuition from a million-line proof is difcult; by computing a skeleton, we obtain a small set of facts—logically implied by the problem—that can give us an idea of how a solver established the unsatisfability of a formula. This can lead to a feedback loop that improves solver performance. For example, when inspecting skeletons for some bounded-model-checking benchmarks, we observed many unit clauses and binary clauses of a certain type. From this, we hypothesized that the problems required more preprocessing, which did indeed improve performance.

Our main contributions are as follows: (1) We present a semantic approach for proof compression that selects only the most important clauses of a proof. (2) We implemented an online version and an ofine version of our approach on top of the SAT solver CaDiCaL and the proof checker DRAT-trim. (3) In an extensive empirical evaluation, we demonstrate that our approach can drastically reduce proof size while still enabling efcient proof reconstruction.

The rest of this paper is structured as follows. In Section 2, we discuss background required to understand our paper and review related work. In Section 3, we outline the main idea behind our proof-compression approach. In Section 4, we show multiple ways to create proof skeletons, and in Section 5 we show how to reconstruct full proofs from skeletons. Finally, in Section 6, we present an empirical evaluation of our approach before concluding in Section 7.

## 2 Background and Related Work

The Boolean satisfability problem (SAT) takes as input a formula of propositional logic and asks if there exists a truth assignment under which the formula evaluates to true. As is common in SAT solving, we consider propositional formulas in conjunctive normal form (CNF), which are defned as follows. A literal is either a variable x (a positive literal) or the negation x¯ of a variable x (a negative literal). The complement ¯l of a literal l is defned as ¯l = ¯x if l = x and as ¯l = x if l = ¯x. For a literal l, we denote the variable of l by var (l). A clause is a fnite disjunction of the form (l<sup>1</sup> ∨ · · · ∨ ln), where l1, . . . , l<sup>n</sup> are literals. Clauses with only one literal are called unit clauses and clauses with two literals are called binary clauses. We denote the empty clause by ⊥. A formula is a fnite conjunction of the form C<sup>1</sup> ∧ · · · ∧ Cm, where C1, . . . , C<sup>m</sup> are clauses. For example, (x ∨ y¯) ∧ (z) ∧ (¯x ∨ z¯) is a formula consisting of the clauses (x ∨ y¯), (z), and (¯x ∨ z¯).

A truth assignment (or assignment for short) is a function from a set of variables to the truth values 1 (true) and 0 (false). A literal l is satisfed by an assignment α if l is positive and α(var (l)) = 1 or if l is negative and α(var (l)) = 0. A literal l is falsifed by an assignment if its complement ¯l is satisfed by the assignment. A clause C is satisfed by an assignment α if α satisfes at least one of C's literals. A formula ψ is satisfed by an assignment α if α satisfes all of ψ's clauses. A formula is satisfable if there exists an assignment that satisfes it, otherwise it is unsatisfable. A clause C = (l<sup>1</sup> ∨ · · · ∨ lk) is implied by a formula ψ, denoted by ψ |= C, if all satisfying assignments of ψ satisfy C, or equivalently, if ψ ∧ C¯ is unsatisfable, where C¯ = (¯l1) ∧ · · · ∧ ( ¯lk). In case a formula is satisfable, modern solvers can output a satisfying assignment; in case the formula is unsatisfable, most solvers can output a proof of unsatisfability.

Proofs of Unsatisfability. State-of-the-art SAT solvers produce so-called clausal proofs. Intuitively, a clausal proof is a list of clause additions and clause deletions. Formally, a clausal proof is a list of pairs ⟨s1, C1⟩, . . . ,⟨sm, Cm⟩, where for each i ∈ 1, . . . , m, s<sup>i</sup> ∈ {a, d} and C<sup>i</sup> is a clause. If s<sup>i</sup> = a, the pair is called an addition, and if s<sup>i</sup> = d, it is called a deletion. For a given input formula ψ0, a clausal proof gives rise to accumulated formulas ψ<sup>i</sup> (i ∈ 1, . . . , m) as follows:

$$\psi\_i = \begin{cases} \psi\_{i-1} \cup \{C\_i\} & \text{if } s\_i = \mathbf{a} \\ \psi\_{i-1} \nmid \{C\_i\} & \text{if } s\_i = \mathbf{d} \end{cases}$$

The clauses of an accumulated formula ψ<sup>i</sup> are also called the active clauses at point i. Clause additions must preserve satisfability, which is usually guaranteed by requiring the added clauses to fulfll some efciently decidable syntactic

criterion that itself implies satisfability is preserved. Deletions are unrestricted and are not useful for proving unsatisfability as they only make a formula "more satisfable"; their main purpose is to speed up proof checking by keeping the set of active clauses small. A valid proof of unsatisfability must end with the addition of the empty clause. As the empty clause is trivially unsatisfable, and since all proof steps preserve satisfability, the unsatisfability of the original formula can then be concluded.

Clausal proof systems are distinguished by the syntactic criterion they impose on clause additions. The standard SAT solving paradigm confict-driven clause learning (CDCL) [15,16] adds so-called RUP (short for reverse unit propagation) clauses [20], whose defnition is based on the notion of unit propagation. Unit propagation is the process of repeatedly applying the unit-clause rule to a formula until no unit clauses are left. Given a formula ψ, the unit-clause rule takes a unit clause (l) and makes its literal l true, meaning that (1) all clauses that contain l are removed from ψ, and (2) the negation ¯l of l is removed from all remaining clauses. If unit propagation produces the empty clause, we say it derived a confict. For example, unit propagation derives a confict on (x)∧(¯x∨y)∧(¯x∨y¯) as the application of the unit-clause rule for (x) produces the formula (y) ∧ (¯y), on which another application of the unit-clause rule, with either of (y) or (¯y), produces the empty clause. If unit propagation derives a confict on a formula, the formula is clearly unsatisfable, but not vice versa.

A clause C = (l<sup>1</sup> ∨ · · · ∨ lk) is a RUP for a formula ψ if unit propagation derives a confict on ψ ∧ C¯. If C is a RUP for ψ, it is implied by ψ since ψ ∧ C¯ is unsatisfable; we thus sometimes write ψ ⊢<sup>1</sup> C to denote that C is a RUP for ψ. The clausal proof system allowing the addition of RUP clauses together with deletions is called DRUP. Solvers participating in the SAT competition must produce DRAT proofs, but since each DRUP proof is also a DRAT proof (but not vice versa) and since all state-of-the-art solvers actually produce DRUP proofs by default, we restrict this study of proof compression to DRUP proofs.

A proof checker is an independent tool that verifes the correctness of proofs. There exist formally verifed proof checkers that provide strong correctness guarantees (c.f., [5, 9, 14, 19]). Because these tools are inefcient, proofs are often passed through an—efcient but unverifed—intermediary proof checker (such as DRAT-trim [22]) that transforms a DRAT proof into a so-called LRAT proof [5]. The resulting LRAT proof includes additional information (called hints), which allows a formally verifed checker to efciently check the proof.

## 3 Problem Overview

We want to compress proofs into small representations that can be efciently decompressed into full proofs. Existing techniques for SAT and SMT focus on transformations and substitutions that preserve validity to generate smaller proofs [4,18,21]. We achieve greater compression by storing only a so-called proof skeleton, which itself is not a valid proof.

Tools like Sledgehammer [3] automatically solve proof obligations from interactive theorem provers, flling gaps in the proof by translating lower-level reasoning into the theorem provers' logic. More recent work proposed a method for constructing proofs for complex SMT rewriting steps on demand in a postprocessing step [17]. In a similar way, we use proof skeletons to efciently reconstruct valid proofs that can difer from the original proofs.

Suppose you solved an unsatisfable CNF formula ψ, and out of the many facts you learned during solving, there were three facts A, B, and C, which you deem particularly important for showing the unsatisfability of ψ. You can then build a proof skeleton from A, B, and C. Later, you can rephrase the question ψ |= ⊥ ("does ψ imply the empty clause?", or equivalently, "is ψ unsatisfable?") into the following questions:

$$\psi \vdash A \qquad \qquad \psi \land A \vdash B \qquad \qquad \psi \land A \land B \vdash C \qquad \qquad \psi \land A \land B \land C \vdash \bot$$

Not only do A, B, and C provide a way to partition the proof efort, when ordered carefully, they can be used as assumptions in subsequent questions. Each question can be submitted to a solver independently, and combining the four resulting proofs will give a proof of the original claim that ψ is unsatisfable.

Our work translates this general schema to the realm of SAT by (1) determining which learned clauses from a SAT solver are most useful and should be stored in a proof skeleton; (2) carefully grouping solver calls to prevent repeated work when producing partial proofs from a proof skeleton; and (3) stitching the partial proofs together to generate a complete proof.

Determining which clauses are stored in a proof skeleton. We co-opt the clauseimportance metrics used by CDCL solvers. We give a brief overview of these metrics in the following. CDCL solvers make progress by continuously learning new clauses that help them prune the search space of possible truth assignments. To limit memory usage, they occasionally perform a clause database reduction, removing a large portion of learned clauses based on some usefulness heuristics. Most solvers keep clauses that are short, have low glue value, are reason clauses, or have been used recently. The glue of a clause (also known as its literal block distance, or LBD) is a positive integer that estimates the usefulness of a clause. Intuitively, a low glue value means that few decisions are required to falsify the clause, which is considered good. For a more extensive discussion of glue, we refer to the respective literature [1]. A reason clause is a clause that was used by the solver when performing unit propagation, meaning that the clause became a unit clause under a partial assignment. The number of times a reason clause is used during confict analysis is considered the clause's activity.

Grouping solver calls for partial proofs. We leverage incremental SAT to construct partial proofs. An incremental SAT solver solves a problem with several related steps, with the solver retaining state (e.g., learned clauses and heuristics) between steps; it also allows solving under so-called assumptions, which are literals assumed to be true in a step. Solving a sequence of related steps incrementally is often much faster than solving them independently of each other (for more details on incremental SAT see, e.g., [6]).

Given a formula ψ and a sequence C1, . . . , C<sup>n</sup> of clauses, we want to produce a DRUP proof of ψ |= C<sup>i</sup> for each i ∈ 1, . . . , n. We use an incremental solver to produce partial proofs, with each solving step corresponding to a clause C<sup>i</sup> . For the frst step, ψ |= C1, we pass the assumptions C¯ <sup>1</sup> = ¯l<sup>1</sup> ∧ · · · ∧ ¯l<sup>k</sup> to the incremental solver. Given the formula ψ, the solver assigns the literals in the assumptions, then runs the CDCL algorithm until it derives the empty clause. During solving, CDCL guarantees that all learned clauses are RUPs for the input formula ψ. Let ϕ<sup>1</sup> denote the sequence of clauses learned by the solver. Then, since unit propagation under the assumptions ¯l<sup>1</sup> ∧ · · · ∧ ¯l<sup>k</sup> derived the empty clause, C<sup>1</sup> is by defnition a RUP for ψ ∧ ϕ1. This means that C<sup>1</sup> can be appended to the corresponding proof of the solver (which derives all clauses in ϕ1) to obtain a valid DRUP derivation of C<sup>1</sup> from ψ.

In the next step, the clause C<sup>2</sup> is handled similarly, except the solver retains the learned clauses ϕ1∧C<sup>1</sup> when proving that C<sup>2</sup> is a RUP clause. This continues until all n + 1 steps corresponding to the n clauses of the proof skeleton are completed (step n + 1 corresponds to the derivation of the empty clause).

To parallelize this reasoning, we use an approach akin to divide-and-conquer techniques established in parallel SAT solving [13]. Divide-and-conquer solvers frst partition a problem into multiple subproblems and then solve the subproblems in parallel. Similarly, we divide the incremental solver steps into so-called chunks, which are independent groups of subsequent solver steps. For example, we can split the solver steps into one chunk containing the frst half of steps and another chunk containing the second half of steps. Both chunks can then be solved in parallel by two independent incremental SAT solvers.

Stitching partial proofs together. Once we have partial proofs for all n+1 solving steps, a full proof of unsatisfability can be constructed as the sequence of clause additions arising from ϕ1, C1, ϕ2, C2, . . . , Cn, ϕn+1, ⊥, where ϕ<sup>i</sup> is the sequence of learned clauses by the i-th solver step, as explained above. In general, clauses are added and deleted during solving, so the proof can be augmented with the deletion information contained in the proofs emitted by a solver. But, we need to ensure clauses are not deleted in the proof and then implicitly reintroduced into a solver, which can occur when inprocessing techniques touch variables in the assumptions. We use variable freezing [7] to freeze all variables occurring in C1, . . . , Cn; this avoids any unsound inprocessing [8], and is required to ensure correctness of the proofs.

## 4 Creating Proof Skeletons

Given a clausal proof P = ⟨s1, C1⟩, . . . ,⟨sm, Cm⟩, we defne a proof skeleton of P to be a sequence of clauses obtained from clause additions in P. Ideally, a skeleton is small but contains enough useful clauses to guide reasoning during proof reconstruction. A proof skeleton can be constructed online, during the solver's execution, by applying a flter to clauses as they are traced to a proof. Alternatively, a proof skeleton can be constructed ofine, after solving, by processing the full proof and selecting important clauses.

#### 4.1 Online Generation of Proof Skeletons

We create proof skeletons online by fltering clause additions as the solver traces them to a proof. Clauses that pass a usefulness threshold are added to the skeleton. As mentioned earlier, the flter applies usefulness heuristics from CDCL including glue and clause activity. Additionally, at certain intervals we add reason clauses to the skeleton. We implemented the flter within the solver CaDiCaL, giving us access to these values as well as to the reason clauses (through the trail of assignments). We also enabled logging, giving every clause a unique identifer, in order to sort the skeletons. We evaluate three diferent confgurations:


The frst two confgurations combine low-glue clauses with either no or some reason clauses. Increasing the glue value threshold often led to a compression of less than 1,000 times and slower reconstruction. Reason clauses are important because they are actively used by the solver whereas for low-glue clauses this is not guaranteed (although low glue is associated with high usage in general). Clause-database reductions are sparse, so reason clauses (which are added only during these reductions) will be added infrequently. We evaluate the impact of including reason clauses in the skeletons in Section 6.3.

In the frst two confgurations, all clauses passing the flter are accepted into the skeleton. For some formulas, a solver will produce many low-glue clauses and the skeleton will become too large, and for others too few low-glue clauses will lead to a small skeleton. Our third confguration accounts for the diferences between formulas by adjusting heuristics dynamically to meet a desired compression ratio. The heuristics are updated based on the number of clauses added to the skeleton within some number of conficts, denoted as windowc. For a compression ratio between 500 and 1,000, and a window<sup>c</sup> value of 5,000, we tuned the Dynamic confguration in the following way: every 5,000 conficts, if more than 25 (windowc/200) lemmas passed the flter, the glue<sup>d</sup> value is decreased, and if less than 3 lemmas (windowc/2,000) passed the flter, the glue<sup>d</sup> value is increased. Reasons from the trail are added every 50,000 conficts (windowc×10).

For confgurations using reason clauses, the unique clause IDs are used to sort the skeleton. This is necessary because reason clauses are traced during reductions, so they may initially appear in the skeleton long after they were learned by the solver. During proof reconstruction it is important that clauses appear in the skeleton in an order that corresponds with a solver's reasoning.

We implemented additional confgurations using clause activities. For this, we incremented an activity feld for each clause every time it was used during confict analysis. An evaluation of these additional confgurations is beyond the scope of this paper, but data can be found in the paper's repository.

## 4.2 Ofine Generation of Proof Skeletons

We create proof skeletons ofine by processing a full proof and selecting the most active clauses. Given a DRAT proof, the tool DRAT-trim uses backwards checking to generate an optimized LRAT proof and, optionally, an UNSAT core (i.e., an unsatisfable subset of the original formula). From the LRAT proof, we can estimate a clause's activity by counting the number of times the clause appears in a hint of a clause-addition step. We then add the clauses with the highest activity to the skeleton until a target compression ratio is met. We found for most problems the target 1,000 provided optimal reconstruction performance. We sort the skeleton by each clause's frst use as a hint in the LRAT proof, signifying when a clause is actually used as opposed to when it is learned. We evaluate three confgurations for ofine generation:


The motivation for Offline-Opt is that some optimized LRAT proofs have signifcantly fewer clauses than the DRAT proofs, resulting from many unused lemmas, which suggests that stronger compression is possible.

Ofine construction requires expensive post-processing with DRAT-trim. However, during online construction we can only guess the future usefulness of clauses when they are derived, by relying on heuristics such as glue, but we cannot know how often a clause will actually be used. For instance, it may be that a clause has low glue (predicting high usefulness) but is learned and then never used in the rest of the proof, making it worthless in the skeleton. In contrast, when constructing a skeleton ofine—after solving—we know already how often the clause was actually used in reasoning throughout the proof, and whether it was used to derive the empty clause. Also, we can use the UNSAT core instead of the original formula when reconstructing a proof for the original problem.

## 5 Reconstructing Proofs from Skeletons

We reconstruct proofs by flling the gaps of a proof skeleton with a SAT solver. Once we have proofs for all gaps, we stitch them together with the clauses of the skeleton to create a complete proof. We can utilize information obtained during proof reconstruction to further shrink skeletons by removing less useful clauses. Finally, we can also use a skeleton to create an optimized LRAT proof.


Fig. 1. Proof reconstruction from a proof skeleton and a formula ϕ by flling in the gaps between skeleton clauses. This can be done with independent SAT calls or with an incremental SAT solver that keeps learned clauses (ϕi) between steps.

#### 5.1 Filling Skeletons Using Incremental Solvers

We consider two ways of flling a proof skeleton's gaps—reconstruction and incremental reconstruction; both are illustrated in Fig. 1. Given a formula ϕ and a skeleton C1, . . . , Cn, reconstruction flls each gap ψ ∧ C<sup>1</sup> ∧ · · · ∧ Ci−<sup>1</sup> |= C<sup>i</sup> using independent SAT solver calls, with ψ<sup>1</sup> ∧ C<sup>1</sup> ∧ · · · ∧ C<sup>n</sup> |= ⊥ as the fnal call. Filling a gap for C<sup>i</sup> = (l<sup>1</sup> ∨ · · · ∨ lk) involves assuming ¯l<sup>1</sup> ∧ · · · ∧ ¯l<sup>k</sup> and deriving the empty clause with proof ϕ, which proves that C<sup>i</sup> is a RUP for ψ ∧ C<sup>1</sup> ∧ · · · ∧ Ci−<sup>1</sup> ∧ ϕ. Each gap has an associated DRUP proof ϕ<sup>i</sup> emitted by the solver. Since RUP is a monotonic property, the clauses added in ϕ<sup>i</sup> will not afect the validity of ϕ<sup>j</sup> for i < j. However, clause deletions could make the proof ϕ1,⟨a, C1⟩, ϕ2,⟨a, C2⟩, . . . ,⟨a, Cn⟩, ϕn+1, ⊥ incorrect. For example, if a skeleton clause C<sup>1</sup> is deleted in ϕ2, then ϕ<sup>3</sup> (stemming from ψ ∧ C<sup>1</sup> ∧ C<sup>2</sup> |= C3) may use C2—a clause already deleted in the proof. The same problem could occur if formula clauses are deleted. Therefore, we must remove any deletion steps for clauses of the skeleton or of the formula clauses from each ϕ<sup>i</sup> .

The second approach, incremental reconstruction, uses an incremental SAT solver, which allows the use of learned clauses when flling subsequent gaps. Specifcally, we create an incremental problem with the steps assume(C¯ <sup>1</sup>), . . . , assume(C¯ <sup>n</sup>), assume(∅), where each step assume(C¯ <sup>i</sup>), with C<sup>i</sup> = (l<sup>1</sup> ∨ · · · ∨ lk), involves assuming ¯l1∧· · ·∧¯l<sup>k</sup> and deriving the empty clause. Each step produces a proof ϕ<sup>i</sup> , and the complete proof ϕ1,⟨a, C1⟩, ϕ2,⟨a, C2⟩, . . . ,⟨a, Cn⟩, ϕn+1,⟨a, ⊥⟩ is correct as long as variables occurring in skeleton clauses are frozen (as described in Section 3). With this approach, we no longer need to worry about deletions of skeleton clauses or formula clauses because the solver flls each gap using the current clause database, i.e., each gap is proved without clauses formerly deleted by the solver.

To parallelize incremental reconstruction, we partition the incremental problem into several independent incremental problems, which we call chunks. We assign k clauses C<sup>l</sup> , . . . , Cl+k−<sup>1</sup> from the skeleton to each chunk, and we then use an incremental solver to compute partial proofs for each of the clauses, starting

from the formula ψ ∧ C<sup>1</sup> ∧ · · · ∧ Cl−1. For each partial proof corresponding to a clause C<sup>i</sup> , we call the solver with the assumptions negating the clause, i.e., with assume(C¯ <sup>i</sup>). Again, we must remove any deletion steps of skeleton clauses or formula clauses since they may be used in later chunks. All added clauses are then RUPs, and so the concatenation of chunk proofs is a complete proof.

Each chunk can be solved independently in parallel. The more skeleton clauses in each chunk, the more clauses the incremental solver can learn and reuse in subsequent steps. However, gaps might difer in hardness, meaning that some gaps can be flled quickly while others require a signifcant amount of solving time. A chunk can thus become a bottleneck during parallelization if it includes many difcult gaps. In our evaluation, we partitioned the skeleton into chunks of equal size, one for each core. For instance, on a single core, one incremental problem spanning the entire skeleton was given to a solver instance whereas for 24 cores, the skeleton was partitioned into 24 chunks. In principle, we could partition a skeleton into more chunks than cores, but this would require an intermediary level of problem scheduling that we leave for future work.

#### 5.2 Shrinking Skeletons

The runtimes for flling each gap of a proof skeleton could provide insight into the usefulness of the skeleton clauses. For example, if the solver can quickly fll a gap, the corresponding skeleton clause may be trivially implied, and if the solver takes long, the clause may be useful since its derivation requires a lot of reasoning. Alternatively, the diference in runtime might not be explained by clause usefulness. Take, for example, the two gaps ψ |= C<sup>2</sup> and ψ ∧ C<sup>2</sup> |= C<sup>5</sup> from Fig. 1, and assume that the solver flls the frst gap in a millisecond and the second gap in ten seconds. If the diference is a result of C<sup>2</sup> being trivially implied, it makes sense to remove C<sup>2</sup> from the skeleton; otherwise, if the diference is due to factors unrelated to usefulness, it is better to remove C5. Based on this observation, we try to shrink a given skeleton by sorting gap reconstruction times and removing a certain share of the slowest or fastest clauses.

Our empirical evaluation in Section 6 indicates that removing the fastest clauses is the right approach for improving compression and (sometimes) reducing reconstruction time. Even though gap runtime and clause usefulness are correlated, the correlation is not perfect. For instance, sometimes the incremental solver is able to quickly fll a gap because of learning from previous steps of the incremental problem. Even if it takes a long time to fll a gap, there is no guarantee that the corresponding skeleton clause is useful for flling future gaps. We examine in detail how shrinking skeletons afects reconstruction time.

#### 5.3 Reconstructing LRAT Proofs from Skeletons

The proof reconstruction described above will produce DRAT proofs. Formally verifed checkers typically require LRAT proofs, forcing a conversion via a proof checker such as DRAT-trim, which can take much longer than the original solving time. Instead, we can reconstruct DRAT proofs for each chunk, then convert the DRAT proofs to LRAT in parallel, and fnally concatenate them.

We use DRAT-trim to convert chunk DRAT proofs to LRAT. This required us to modify DRAT-trim (e.g., by changing the way it performs backwards checking, and how it handles unit clauses). By default, DRAT-trim starts backwards checking at the empty clause. But, only the last chunk will derive the empty clause, and further, we must ensure all skeleton clauses are included in the backwards check, as they may be used in later chunks. To account for this, we mark each skeleton clause in the DRAT proof before performing the backwards check. The backwards check verifes that each marked clause is RAT (or RUP, in our case), including the clauses in the LRAT proof. When combining the chunk LRAT proofs, we map the skeleton clauses in each chunk to the index of the LRAT step where they were initially added. Finally, we remove all deletions from the LRAT proof, but this will not afect proof-checking time, mainly since LRAT checkers perform unit propagation in linear time using hints. While the following evaluation focuses on DRAT proof reconstruction from skeletons, we tested our implementation of parallel LRAT proof reconstruction on 24 cores, and verifed several proofs with Cake-LPR [19].

## 6 Experimental Evaluation

We evaluated our approach on SAT competition 2021 Main Track benchmarks, using all (65) unsatisfable formulas that were solved between 500 and 5,000 seconds by the solver CaDiCaL [2]. By requiring at least 500 seconds of solving time, we ensured that proofs are of reasonable size (around 1 GB) and therefore good candidates for compression. We ran experiments on an AWS EC2 m5d.metal instance, with 96 virtual CPUs and 500 GB of memory, running at most 24 parallel processes at a time. We used a timeout of 5,000 seconds for solving a problem and constructing a DRAT proof. For proof reconstruction on a single core we used a single incremental problem spanning the entire skeleton. For proof reconstruction on 24 cores, we evenly divided the proof skeleton into 24 incremental problems (chunks) passed to 24 instances of CaDiCaL. We report real time for proof reconstruction, not including skeleton extraction.

#### 6.1 Single-Core Proof Reconstruction

Fig. 2 shows the best confgurations on each formula using online skeletons (left) and ofine skeletons (right), for the single-core experiments (i.e., the entire skeleton on a single core). Almost all proofs were reconstructed faster than the original solving time (below the red dotted line), and in some cases more than fve times faster (below the blue dotted line). Each confguration was the best for some formulas. The Glue confguration led the online skeletons. With a single incremental problem, learned clauses from earlier incremental calls can be kept for the entire execution, meaning that clauses that occur later in large skeletons (e.g., Glue+Trail) may be trivially implied by previously learned clauses.

Fig. 2. Runtimes (in seconds) of best online (left) and ofine (right) confgurations for proof reconstruction using a proof skeleton and a single core.

Fig. 3. Proof skeleton compression ratio for online (left) and ofine (right).

#### 6.2 Skeleton Compression Ratio

Fig. 3 shows the sorted compression ratios (w.r.t. fle size) between proof skeletons and the original DRAT proofs for each confguration as well as the compression ratios for the confguration with the fastest reconstruction time on each formula (Best). For online confgurations (left), the Dynamic skeletons have the most consistent compression ratios, with a tradeof in reconstruction times. In some cases, skeletons can have higher compression (10,000 times) without a loss in performance, witnessed by the right-hand-side tail of the plot.

For ofine confgurations (right), Offline selects 1/1,000 of the clauses from the original DRAT proof. The ratios are much greater than 1,000 because skele-

Fig. 4. Runtimes (in seconds) for proof reconstruction of multiple online confgurations with a single core (left) and 24 cores (right).

tons have no deletion information and the most active clauses are typically much shorter than the average clause. Offline-Opt provides around a factor 10 more compression, and these smaller skeletons provide faster reconstruction for about half of the formulas. In general, the compression is much better when using clause activity as a measure for clause importance as opposed to online heuristics (such as glue), with similar reconstruction times seen in Fig. 2.

#### 6.3 Impact of Reason Clauses in Online Skeletons

Fig. 4 shows a comparison of reconstruction times between the Glue and the Glue+Trail online confgurations, both on a single core (left) and on 24 cores (right). On a single core, creating skeletons with only low-glue clauses performs better than creating skeletons with low-glue clauses and reasons from the trail. On multiple cores, however, the reason clauses are benefcial for many reconstructions. This may be because for parallel reconstruction, each individual chunk only has access to lemmas earlier in the skeleton during solving. Therefore, having more clauses in the skeleton will aid the later chunks. In contrast, for a single chunk on one core, learned clauses are kept throughout solving, and these learned clauses supplement the smaller skeletons.

#### 6.4 Impact of the UNSAT Core on Ofine Skeletons

Fig. 5 shows the efect of using an UNSAT core during reconstruction for ofine skeletons on a single core (left) and on 24 cores (right). For the experiments using an UNSAT core, we remove formula clauses that are not in the UNSAT core before passing the formula to the solver during the incremental SAT call for the chunk proof. Using the UNSAT core greatly improves performance during

Fig. 5. Runtimes (in seconds) for Offline proof reconstruction with and without an UNSAT core with a single core (left) and 24 cores (right).

reconstruction on a single core. This may be because the skeleton is built from reasoning based on the UNSAT core, so focusing the solver on these specifc formula clauses makes flling the gaps in the skeleton easier. The UNSAT core is useful in parallel reconstruction as well, producing the overall best confguration between online and ofine skeletons. To give an idea, it takes approximately 125 KB to store an UNSAT core as a bit vector (each bit indicating whether or not a clause is part of the core) for a formula with one million clauses. For most formulas, this data would be dominated by the size of the proof skeleton.

#### 6.5 Skeleton Shrinking after Reconstruction

We discussed in Section 5.2 that it might make sense to shrink a skeleton by removing some amount of the fastest or of the slowest skeleton clauses. Fig. 6 shows results for reconstruction on 24 cores using the online skeleton, removing either the fastest 90% or the slowest 10% of clauses. To perform the shrinking, we performed proof reconstruction from the skeleton and measured the solve times for the incremental calls, with each call corresponding to a skeleton clause. Removing the fastest 90% has a small impact on reconstruction time, performing slower for the majority of formulas. In some cases, shrinking the skeleton even improves performance because redundant or unnecessary clauses are removed from the skeleton. Removing the slowest solved clauses causes a wider variation in reconstruction time. This might be because these clauses are important for guiding the solver during reconstruction, and sometimes they lead the solver into unproftable search regions that waste time. This shows two things: (1) For some formulas, removing only a fraction of clauses from the skeleton can lead to a big or small improvement, and (2) skeleton clauses are mostly nontrivial and cannot be added or removed randomly without a potentially consequential impact.

Fig. 6. Runtimes (in seconds) of proof reconstruction on 24 cores after skeleton shrinking for the Dynamic online confguration, removing the fastest 90% (left) or the slowest 10% (right) of clauses from the skeleton.

Fig. 7. Left: Runtimes (in seconds) of original solver on a single core against proof reconstruction on 24 cores with the best ofine-skeleton confgurations Offline+Units using UNSAT cores. Right: Runtimes (in seconds) of parallel SAT solvers Mallob and Lingeling without proof logging against proof reconstruction with the best offine skeleton confgurations using an UNSAT core, each using 24 cores.

#### 6.6 Comparison With Sequential and Parallel SAT Solvers

Alternatives to our proof reconstruction could be to compute a proof on demand by solving a formula from scratch (either with a sequential or with a parallel SAT solver) or to run a parallel incremental solver that flls the gaps of a skeleton.

The left plot of Fig. 7 shows the diference between running a sequential solver on a single core versus running our parallel proof reconstruction on 24 cores. For the majority of formulas, parallel proof reconstruction is over fve times faster, and in some cases closer to ten times faster. One formula had little improvement for reconstruction (on the red dotted line). For this formula, the fnal chunk took around 2,000 seconds to solve, and the next slowest chunk took only 24 seconds, meaning the hardest gaps were all clustered in the fnal chunk. For these sorts of problems, a smaller chunk size could break up the hard gaps, therefore improving utilization across cores and reducing the reconstruction time.

To our knowledge, there exist no portfolio solvers or parallel incremental solvers that produce proofs. However, it might be possible to add proof support to solvers like Mallob (a clause-sharing portfolio solver) or iLingeling (a parallel incremental solver); we thus compare our approach to these solvers in the right plot of Fig. 7.

The comparison to Mallob suggests that some form of clause sharing between solvers that solve independent chunks may improve performance. This could be achieved with forward clause sharing, where learned clauses can only be sent to solvers running on subsequent chunks. Also, Mallob has full core utilization by running each solver until one derives the empty clause, but our proof reconstruction does not since some chunks take longer than others. With smaller chunk sizes and good scheduling, proof reconstruction could get closer to full utilization.

iLingeling, which is based on Lingeling [2], takes an incremental problem and greedily assigns steps to solver instances, terminating when one instance derives the empty clause. There is no clause sharing between solvers. We ran iLingeling using the incremental problem derived from the proof skeleton. In proof reconstruction, chunks can use skeleton clauses from previous chunks, leading to consistently better performance than iLingeling.

## 7 Conclusion

We presented a semantic approach for compressing propositional proofs by selecting important clauses that summarize the reasoning of a solver. We store these clauses in a so-called proof skeleton, from which we can reconstruct a complete proof in parallel by performing multiple incremental SAT solver calls. We implemented our approach on top of the SAT solver CaDiCaL and the proof checker DRAT-trim. In an empirical evaluation, we showed that our approach can produce skeletons that are 100 to 5,000 times smaller than the original proofs. On a single core, almost all proofs were reconstructed faster than the original solving time, and when using 24 cores, the majority of proofs was reconstructed around fve times faster. This is signifcant since proof checking typically takes longer than solving, and since existing parallel solvers cannot produce proofs while maintaining strong performance. We observed that proof skeletons not only serve as a compression mechanism but also provide insight into a problem. In future work, we thus plan to explore the connection between skeletons, proofs, and solver performance.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Unsatisfability Proofs for Distributed Clause-Sharing SAT Solvers

Dawn Michaelson2() , Dominik Schreiber3() , Marijn J. H. Heule<sup>1</sup>,<sup>4</sup> , Benjamin Kiesl-Reiter<sup>1</sup> , and Michael W. Whalen<sup>1</sup>,<sup>2</sup>

> Amazon Web Services, Seattle, USA University of Minnesota, Minneapolis, USA micha576@umn.edu Karlsruhe Institute of Technology, Karlsruhe, Germany dominik.schreiber@kit.edu Carnegie Mellon University, Pittsburgh, USA

Abstract. Distributed clause-sharing SAT solvers can solve problems up to one hundred times faster than sequential SAT solvers by sharing derived information among multiple sequential solvers working on the same problem. Unlike sequential solvers, however, distributed solvers have not been able to produce proofs of unsatisfability in a scalable manner, which has limited their use in critical applications. In this paper, we present a method to produce unsatisfability proofs for distributed SAT solvers by combining the partial proofs produced by each sequential solver into a single, linear proof. Our approach is more scalable and general than previous explorations for parallel clause-sharing solvers, allowing use on distributed solvers without shared memory. We propose a simple sequential algorithm as well as a fully distributed algorithm for proof composition. Our empirical evaluation shows that for large-scale distributed solvers (100 nodes of 16 cores each), our distributed approach allows reliable proof composition and checking with reasonable overhead. We analyze the overhead and discuss how and where future eforts may further improve performance.

Keywords: SAT solving · proofs · distributed computing.

## 1 Introduction

SAT solvers are general-purpose tools for solving complex computational problems. By encoding domain problems into propositional logic, users have successfully applied SAT solvers in various felds such as formal verifcation [31], automated planning [25], and mathematics [8, 16]. The list of applications has grown signifcantly over the years, mainly because algorithmic improvements have led to orders of magnitude improvement in the performance of the best sequential solvers (see, e.g., [21] for a comparison).

Despite all this progress, there are still many problems that cannot be solved quickly with even the best sequential solvers, pushing researchers to explore ways of parallelizing SAT solving. One approach that has worked well for specifc problem instances is Cube-and-Conquer [17, 18], which can achieve near-linear speedups for thousands of cores but requires domain knowledge about how effectively to split a problem into subproblems. An alternative approach that does not require such knowledge is clause-sharing portfolio solving, which has recently led to solvers [12,28] achieving impressive speedups (10x–100x on a 100x16 core cluster) over the best sequential solvers across broad sets of benchmarks.<sup>5</sup>

Although distributed solvers are demonstrably the most powerful tools for solving hard SAT problems, there is an important caveat: unlike sequential solvers, current distributed clause-sharing solvers cannot produce proofs of unsatisfability. While there has been foundational work in producing proofs for shared-memory clause-sharing SAT solvers [14], existing approaches are neither scalable nor general enough for large-scale distributed solvers. This is not just a theoretical problem—for four problems in the 2020 and 2021 SAT competitions, distributed solvers produced incorrect answers that were not discovered until the 2022 competition because they could not be independently verifed.<sup>6</sup>

In this paper, we deal with this issue and present the frst scalable approach for generating proofs for distributed SAT solvers. To construct proofs, we maintain provenance information about shared clauses in order to track how they are used in the global solving process, and we use the recently-developed LRAT proof format [9] to track dependencies among partial proofs produced by solver instances. By exploiting these dependencies, we are then able to reconstruct a single linear proof from all the partial proofs produced by the sequential solvers. We frst present a simple sequential algorithm for proof reconstruction before devising a parallel algorithm that can even be implemented in a distributed way. Both algorithms produce independently-verifable proofs in the LRAT format. We demonstrate our approaches using an LRAT-producing version of the sequential SAT solver CaDiCaL [5] to turn it into a clause-sharing solver, and then modify the distributed solver Mallob [28] to orchestrate a portfolio of such CaDiCaL instances while tracking the IDs of all shared clauses.

We conduct an evaluation of our approaches from the perspective of efciency, benchmarking the performance of our clause-sharing portfolio solver against the winners of the cloud track, parallel track, and sequential track from the SAT Competition 2022. Adding proof support introduces several kinds of overhead for clause-sharing portfolios in terms of solving, proof reconstruction, and proof checking, which we examine in detail. We show that even with this overhead, distributed solving and proving is much faster than the best sequential approaches. We also demonstrate that our approach dramatically outperforms previous work on proof production for clause-sharing portfolios [14]. We argue that much of the overhead of our current setup can be compensated, among other measures, by improving support for LRAT in solver backends. We thus hope that our work provides an impetus for researchers to add LRAT support to other solvers.

Our main contributions are as follows:

<sup>5</sup> c.f.: the SAT Competition 2022 results:

https://satcompetition.github.io/2022/downloads/sc2022-detailed-results.zip

<sup>6</sup> The incorrectly scored problems were SAT\_MS\_sat\_nurikabe\_p08.pddl\_71.cnf, randomG-Mix-n18-d05.cnf, php12e12.cnf, and Cake\_9\_20.cnf.


The rest of this paper is structured as follows. In Section 2, we present the background required to understand the rest of our paper and discuss related work. In Section 3, we describe the general problem of producing proofs for distributed SAT solving and a simple algorithm for proof combination. In Section 4, we describe a much more efcient distributed version of our algorithm before discussing implementation details in Section 5. Finally, we present the results of our empirical evaluation in Section 6 and conclude with a summary and an outlook for future work in Section 7.

## 2 Background and Related Work

The Boolean satisfability problem (SAT) asks whether a Boolean formula can be satisfed by some assignment of truth values to its variables. An overview can be found in [6]. We consider formulas in conjunctive normal form (CNF). As such, a formula F is a conjunction (logical "AND") of disjunctions (logical "OR") of literals, where a literal is a Boolean variable or its negation. For example, (a ∨ b ∨ c) ∧ (b ∨ c) ∧ (a) is a formula with variables a, b, c and three clauses. A truth assignment A maps each variable to a Boolean value (true or false). A formula F is satisfed by an assignment A if F evaluates to true under A, and F is satisfable if such an assignment exists. Otherwise, F is called unsatisfable.

If a formula F is found to be satisfable, modern SAT solvers commonly output a truth assignment; users can easily evaluate F under the assignment in linear time to verify that F is indeed satisfable. In contrast, if a formula turns out unsatisfable, sequential SAT solvers produce an independently-checkable proof that there exists no assignment that satisfes the formula.

File Formats in Practical SAT Solving. In practical SAT solving, formulas are specifed in the DIMACS format. DIMACS fles feature a header of the form 'p cnf #variables #clauses' followed by a list of clauses, one clause per line. For example, the clause (x<sup>1</sup> ∨ x<sup>2</sup> ∨ x3) is represented as '1 -2 3 0'. An example formula in DIMACS format is given in Figure 1.

The current standard format for proofs is DRAT [15]. DRAT fles are similar to DIMACS fles, with each line containing a proof statement that is either an addition or a deletion. Additions are lines that represent clauses like in the DI-MACS format; they identify clauses that were derived ("learned") by the solver. Each clause addition must preserve satisfability by adhering to the so-called


Fig. 1: DIMACS formula and corresponding proofs in DRAT and LRAT format.

RAT criterion—as the details of RAT are not essential to our paper, we refer the reader to the respective literature for more details [20]. Deletions are lines that start with a 'd', followed by a clause; they identify clauses that were deleted by the solver because they were not deemed necessary anymore. Clause deletions can only make a formula "more satisfable", meaning that they aren't required for deriving unsatisfability, but they drastically speed up proof checking. A valid DRAT proof of unsatisfability ends with the derivation of the empty clause. As the empty clause is trivially unsatisfable (and since each proof step preserves satisfability) the unsatisfability of the original formula can then be concluded. An example DRAT proof is given in Figure 1.

The more recent LRAT proof format [9] augments each clause-addition step with so-called hints, which identify the clauses that were required to derive the current clause. This makes proof checking more efcient, and in fact the usual pipeline for trusted proof checking is to frst use an efcient but unverifed tool (like DRAT-trim [15]) to transform a DRAT proof into an LRAT proof, and then check the resulting LRAT proof with a formally verifed proof checker (c.f., [9, 13, 22, 30]). Figure 1 shows an LRAT proof corresponding to a DRAT proof. Each proof line starts with a clause ID. The numbering starts with 9 because the eight clauses of the original formula are assigned the IDs 1 to 8. Each clause addition frst lists the literals of the clause, then a terminating 0, followed by hints (in the form of clause IDs), and fnally another 0. For example, clause 9 contains the literal -3 and can be derived from the clauses 4 and 5 of the original formula. Clause deletions just state the clause ID of the clause that is to be deleted, as in the later deletion of clause 9. In our work, we exploit the hints of LRAT to determine dependencies among distributed solvers.

Parallel and Distributed SAT Solving. One way to parallelize SAT solving is to run a portfolio of sequential solvers in parallel and to consider a problem solved as soon as one of the solvers fnishes (c.f. [1, 4, 5, 11, 12, 18, 23, 29, 32]). Given that the solvers are sufciently diverse, portfolio solving is already efective if all of the sequential solvers work independently, but performance and scalability can be boosted signifcantly by having the solvers share information in the form of learned clauses [4, 12]. This approach is taken by the distributed solver Mallob [28], which won the cloud track of the last three SAT competitions [2,3,27]. As opposed to other solvers, Mallob relies on a communication-efcient aggregation strategy to collect the globally most useful learned clauses and to reliably flter duplicates as well as previously shared clauses [27]. With this strategy, which aims to maximize the density and utility of the communicated data, Mallob scored frst place in all four eligible subtracks for unsatisfable problems at the 2022 SAT Competition.

As we discuss in more detail later, the drawback of clause sharing is that a local proof written by an individual solver may contain clauses whose derivations cannot be justifed because they rely on clauses imported from another solver. Previous work focuses on writing DRAT proofs for clause-sharing parallel solvers [14]. In that work, solvers write to the same shared proof as they learn clauses. However, since the clauses are shared, one solver deleting a clause could invalidate a later clause-addition by another solver that is still holding the clause. To handle this, the parallel solver moderates deletion statements, only writing them to the proof once all solvers have deleted a clause, which leads to poor scalability during proof search. In our approach, solvers write proof fles fully independently—only when the unsatisfability of the problem has been determined do we combine all proofs into a single valid proof.

Other recent work includes reconstructing proofs from divide-and-conquer solvers [24] and from a particular shared-memory parallel solver [10] whereas we aim to exploit distributed portfolio solving.

## 3 Basic Proof Production

Our goal is to produce checkable unsatisfability proofs for problems solved by distributed clause-sharing SAT solvers. We propose to reuse the work done on proofs for sequential solvers by having each solver produce a partial proof containing the clauses it learned. These partial proofs are invalid in general because each sequential solver can rely on clauses shared by other solvers when learning new clauses. For example, when solver A derives a new clause, it might rely on clauses from solvers B and C, which in turn relied on clauses from solvers D and E, and so on. The justifcation of A's clause derivation is thus spread across multiple partial proofs. We need to combine the partial proofs into a single valid proof in which the clauses are in dependency order, meaning that each clause can be derived from previous clauses.

To generate an efciently-checkable combined proof in a scalable way, we must solve three challenges:

1. Provide metadata to identify which solver produced each learned clause.

2. Efciently sort learned clauses in dependency order across all solvers.

3. Reduce proof size by removing unnecessary clauses.

Switching from DRAT to the LRAT proof format provides the mechanism to unlock all three challenges. First, we specialize the clause-numbering scheme used by LRAT in order to distinguish the clauses produced by each solver. Second, we use the dependency information from LRAT to construct a complete proof from the partial proofs produced by each solver. Finally, we determine which clauses are unnecessary (or used only for certain parts of the proof) to delete clauses from the proof as soon as they are no longer required.


Algorithm 1 Algorithm for combining partial proofs

We update the clause-distribution mechanism in the distributed solver to broadcast the clause ID with each learned clause. A receiving solver stores the clause with its ID and uses the ID in proof hints when the clause is used locally, as it does with locally-derived clauses. Unlike locally-derived clauses, we add no derivation lines for remote clauses to the local proof. Instead, these derivations will be added to the fnal proof when combining the partial proofs.

#### 3.1 Solver Partial Proof Production

To combine the partial proofs into a complete proof, we modify the mechanism producing LRAT proofs in each of the component solvers. We assign to each clause an ID that is unique across solvers and identifes which solver originally derived it. The following mapping from clauses to IDs achieves this:

Defnition 1. Let o be the number of clauses in the original formula and let n be the number of sequential solvers. Then, the ID of the k-th derived clause (k ≥ 0) of solver i is defned as ID<sup>i</sup> <sup>k</sup> = o + i + nk.

Given ID<sup>i</sup> k , we can easily determine the solver ID i using modular arithmetic.

#### 3.2 Partial Proof Combination

Once the distributed solver has concluded the input formula is unsatisfable, we have n partial proofs. The clause derivations in these proofs refer to clauses of other partial proofs, but they are, locally, in dependency order. We can therefore combine the partial proofs without reordering their clauses beforehand. We can simply interleave their clauses so the resulting proof is also in dependency order, ignoring any deletions in the partial proofs.

Our algorithm goes through the partial proofs round-robin, at each step emitting all the clauses from each fle where the dependencies of the clause have


Fig. 2: Partial proofs and combined proof of unsatisfability.

already been emitted. It ends when the empty clause is emitted. The procedure is shown in Algorithm 1. For each partial proof, we maintain an iterator over the learned clauses. We add the next clause from the current partial proof (pi) to the fnal proof if its dependencies are satisfed (determined by comparing each hint to the last clause emitted from the partial proof whence it originated); otherwise it cycles to the next partial proof. It emits the line and moves to the next clause in the fle. The algorithm terminates when it emits the empty clause (line 10).

Example 1. Suppose that two solver instances (instance 1 and instance 2) determined together that the formula from Figure 1 is unsatisfable, with the two partial proofs shown in Figure 2. We start with instance 1. As clause 9 only relies on original clauses, we emit it. Clause 11 relies on original clause 6 and emitted clause 9, so we emit it. Clause 13 relies on clauses 8 and 12, which is not emitted, so we cannot emit clause 13 and move to instance 2. Clause 10 can be emitted, as can clause 12, which relies on an original and an emitted clause. Clause 14 relies on emitted clauses 11 and 10 and on original clause 1, so we can emit it as well. Since clause 14 is the empty clause, we fnish with a complete proof, shown in Figure 2(c). Notice that clause 13 was not added to the combined proof, since it was not required to satisfy any dependencies of the empty clause.

#### 3.3 Proof Pruning

The combined proof produced by our procedure is valid but not efciently checkable because (1) it can contain clauses that are not required to derive the empty clause and (2) it does not contain deletion lines, meaning that a proof checker must maintain all learned clauses in memory throughout the checking process. To reduce size and to improve proof-checking performance, we prune our combined proof toward a minimal proof containing only necessary clauses, and we add deletion statements for clauses as soon as they are not needed anymore.

Algorithm 2 shows our pruning algorithm that walks the combined proof in reverse (similar to backward checking of DRAT proofs [19]). We maintain a set of clauses required in the proof, initialized to the empty clause alone. We then process all clauses in reverse order, including the empty clause, ignoring all clauses not in the required set. For each required clause, we check its dependencies to see if this is the frst time (from the proof's end) a dependency is seen; if so, we emit a deletion line for the dependency since it will never be used again in the proof. After checking all its dependencies, we output the clause itself. The



fnal output of the algorithm is a proof in reversed order, where each clause is required for some derivation and deleted as soon as it is no longer required.

Example 2. Consider the combined proof from Figure 2. After applying Algorithm 2, working backward from clause 14, we determine that clause 12 is not required, so it is removed. Additionally, prior to clause 11, clause 9 is not in the required set, so it can be deleted after processing clause 11. On larger proofs, as discussed in Section 6, pruning can reduce the size of the proof by 10x or more.

## 4 Distributed Proof Production

The proof production as described above is sequential and may process huge amounts of data, all of which needs to be accessible from the machine that executes the procedure. In addition, maintaining the required clause IDs during the procedure may require a prohibitive amount of memory for large proofs. In the following, we propose an efcient distributed approach to proof production.

#### 4.1 Overview

Our previous sequential proof-combination algorithm frst combines all partial proofs into a single proof and then prunes unneeded proof lines. In contrast, our distributed algorithm frst prunes all partial proofs in parallel and only then merges them into a single fle.

We have m processes with c solver instances each, amounting to a total of n = mc solvers. We make use of the fact that the solvers exchange clauses in periodic intervals (one second by default). We refer to these intervals between subsequent sharing operations as epochs. Consider Fig. 3 (left): Clause 118 was produced by S<sup>2</sup> in epoch 1. Its derivation may depend on local clause 114 and on any of the 11 clauses produced in epoch 0, but it cannot depend, e.g., on clause 109 or 111 since these clauses have been produced after the last clause sharing. More generally, a clause c produced by instance i during epoch e can only depend on (i) earlier clauses by instance i produced during epoch e or earlier, and (ii) clauses by instances j ̸= i produced before epoch e.

Fig. 3: Four solvers work on a formula with 99 original clauses, produce new clauses (depicted by their ID), and share clauses periodically, without (left) and with (right) aligning clause IDs.

Using this knowledge, we can essentially rewind the solving procedure. Each process reads its partial proofs in reverse order, outputs each line which adds a required clause, and adds the hints of each such clause to the required clauses. Required remote clauses produced in epoch e are transferred to their process of origin before any proof lines from epoch e are read. As such, whenever a process reads a proof line, it knows whether the clause is required. The outputs of all processes can be merged into a single valid proof (Section 4.3).

#### 4.2 Distributed Pruning

Clause ID Alignment. To synchronize the reading and redistribution of clause IDs in our distributed pruning, we need a way to decide from which epoch a remote clause ID originates. However, solvers generally produce clauses with diferent speeds, so the IDs by diferent solvers will likely be in dissimilar ranges within the same epoch over time. For instance, in Fig. 3 (left) instance S<sup>3</sup> has no way of knowing from which epoch clause 118 originates. To solve this issue, we propose to align all produced clause IDs after each sharing. During the solving procedure, we add a certain ofset δ<sup>e</sup> <sup>i</sup> to each ID produced by instance i in epoch e. As such, we can associate each epoch e with a global interval [Ae, Ae+1) that contains all clause IDs produced in that epoch. In Fig. 3 (right), A<sup>0</sup> = 100, A<sup>1</sup> = 116, and A<sup>2</sup> = 128. Clause 118 on the left has been aligned to 122 on the right (δ<sup>1</sup> <sup>2</sup> = 4) and due to A<sup>1</sup> ≤ 122 < A<sup>2</sup> all instances know that this clause originates from epoch 1.

Initially, δ<sup>0</sup> <sup>i</sup> := 0 for all i. Let I<sup>e</sup> <sup>i</sup> be the frst original (unaligned) ID produced by instance i in epoch e. With the sharing that initiates epoch e > 0, we compute the common start of epoch <sup>e</sup>, <sup>A</sup><sup>e</sup> := maxi{I<sup>e</sup> <sup>i</sup> + δ<sup>e</sup>−<sup>1</sup> <sup>i</sup> − i}, as the lowest possible value that is larger than all clause IDs from epoch e−1. We then compute ofsets δe <sup>i</sup> in such a way that I<sup>e</sup> <sup>i</sup> +δ<sup>e</sup> <sup>i</sup> = Ae+i, which yields δ<sup>e</sup> <sup>i</sup> := (Ae+i)−I<sup>e</sup> <sup>i</sup> . If we then export a clause produced during e by instance i, we add δ<sup>e</sup> <sup>i</sup> to its ID, and if we import shared clauses to i, we flter any clauses produced by i itself. Note that we do not modify the solvers' internal ID counters or the proofs they output. Later, when reading the partial proof of solver i at epoch e, we need to add δ<sup>e</sup> i to each ID originating from i. All other clause IDs are already aligned.

Rewinding the Solve Procedure. Assume that instance u ∈ {1, . . . , n} has derived the empty clause in epoch eˆ. For each local solver i, each process has a frontier F<sup>i</sup> of required clauses produced by i. In addition, each process has a backlog B of remote required clauses. B and F<sup>i</sup> are collections of clause IDs and can be thought of as maximum-frst priority queues. Initially, F<sup>u</sup> contains the ID of the empty clause while all other frontiers and backlogs are empty. Iteration x ≥ 0 of our algorithm processes epoch eˆ− x and features two stages:

1. Processing: Each process continues to read its partial proofs in reverse order from the last introduced clause of the current epoch. If a line from solver i is read whose clause ID is at the top of F<sup>i</sup> , then the ID is removed from F<sup>i</sup> , the line is output, and each clause ID hint h in the line is treated as follows:

– h is inserted in F<sup>j</sup> if local solver j (possibly j = i) produced h.

– h is inserted in B if a remote solver produced h.

– h is dropped if h is an ID of an original clause of the problem.

Reading stops as soon as a line's ID precedes epoch e = ˆe − x. Each F<sup>i</sup> as well as B now only contain clauses produced before e.

2. Task redistribution: Each process extracts all clause IDs from B that were produced during eˆ−x−1. These clause IDs are aggregated among all processes, eliminating duplicates in the same manner as Mallob's clause sharing detects duplicate clauses [28]. Each process traverses the aggregated clause IDs, and each clause produced by a local solver i is added to F<sup>i</sup> .

Our algorithm stops in iteration eˆ after the Processing stage, at which point all frontiers and backlogs are empty and all relevant proof lines have been output.

Analysis. In terms of total work performed, all partial proofs are read completely. For each required clause we may perform an insertion into some B, a deletion from said B, an insertion into some F<sup>i</sup> , and a deletion from said F<sup>i</sup> . If we assume logarithmic work for each insertion and deletion, the work for these operations is linear in the combined size of all partial proofs and loglinear in the size of the output proof. In addition, we have eˆ iterations of communication whose overall volume is bounded by the communication done during solving. In fact, since only a subset of shared clauses are required and we only share 64 bits per clause, we expect strictly less communication than during solving. Computing A<sup>e</sup> for each epoch e during solving is negligible since the necessary aggregation and broadcast can be integrated into an existing collective operation. Regarding memory usage, the size of each B and each F<sup>i</sup> can be proportional to the combined size of all required lines of the according partial proofs. However, we can make use of external data structures which keep their content on disk except for a few bufers.

#### 4.3 Merging Step

For each partial proof processed during the pruning step, we have a stream of proof lines sorted in reverse chronological order, i.e., starting with the highest clause ID. The remaining task is to merge all these lines into a single, sorted proof fle. As shown in Fig. 4 (left), we arrange all processes in a tree. We can easily merge a number of sorted input streams into a single sorted output stream

Fig. 4: Left: Proof merging with seven processes and 14 solvers. Each box represents a process with two local proof sources. Dashed arrows denote communication. Right: Example of merging three streams of LRAT lines into a single stream. Each number i represents an LRAT line describing a clause of ID i.

by repeatedly outputting the line with the highest ID among all inputs (Fig. 4 right). This way, we can hierarchically merge all streams along the tree. At the tree's root, the output stream is directed into a fle. This is a sequential I/O task that limits the speed of merging. Finally, since the produced fle is in reverse order, a bufered operation reverses the fle's content.

A fnal challenge is to add clause deletions to the fnal proof. Before a line is written to the combined proof fle, we can scan its hints and output a deletion line for each hint we did not encounter before (see Section 3.3). However, implementing this in an exact manner requires maintaining a set of clause IDs which scales with the fnal proof size. Since our proof remains valid even if we omit some clause deletions, we can use an approximate membership query (AMQ) structure with fxed size and a small false positive rate, e.g., a Bloom flter [7].

## 5 Implementation

We employ a solver portfolio based on the sequential SAT solver CaDiCaL [5]. We modifed CaDiCaL to output LRAT proof lines and to assign clause IDs as described in Section 3.1. To ensure sound LRAT proof logging, some features of CaDiCaL currently need to be turned of, such as bounded variable elimination, hyper-ternary resolution, and vivifcation. Similarly, Mallob's original portfolio of CaDiCaL confgurations features several options that are incompatible with our proof logging as of yet. Therefore, we created a smaller portfolio of "safe" confgurations that include shufing variable priorities, adjusted restart intervals, and disabled inprocessing. We also use diferent random seeds and use Mallob's diversifcation based on randomized initial variable polarities.

We modifed Mallob to associate each clause with a 64-bit clause ID. For consistent bookkeeping of sharing epochs, we defer clause sharing until all processes have fully initialized their solvers. While several solvers may derive the empty clause simultaneously, only one of them is selected to be the "winner" whose empty clause will be traced. The distributed proof production features communication similar to Mallob's clause sharing. To realize the frontier F<sup>i</sup> and the backlog B described in Section 4.2, we implemented an external-memory data structure which writes clause IDs to disk, categorized by their epoch. Upon reaching a new epoch, all clause IDs from this epoch are read from disk and inserted into an internal priority queue to allow for efcient polling and insertion. To merge the pruned partial proofs, we use point-to-point messages to query and send bufers of proof lines between processes. We interleave this merging with the pruning procedure in order to avoid writing the intermediate output to disk. We use a fxed-size Bloom flter to add some deletion lines to the fnal proof.

## 6 Evaluation

In this section, we present an evaluation of our proof production approaches. We provide the associated software as well as a digital appendix online.<sup>7</sup>

#### 6.1 Experimental Setup

Supporting proofs introduces several kinds of performance overhead for clausesharing portfolios in terms of solving, proof reconstruction, and proof checking. We wish to examine how well our proof-producing solver performs against (1) best-of-breed parallel and cloud solvers that do not produce proofs, (2) previous approaches to proof-producing parallel solvers, and (3) best-of-breed sequential solvers. We analyze the overhead introduced by each phase of the process, and we discuss how and where future eforts might improve performance.

We use the following pipeline for our proof-producing solvers: First, the input formula is preprocessed by performing exhaustive unit propagation. This is necessary due to a technical limitation of our LRAT-producing modifcation of CaDiCaL. Second, we execute our proof-producing variant of Mallob on the preprocessed formula. Third, we prune and combine all partial proofs, using either our sequential proof production or our distributed proof production. Fourth, we merge the preprocessor's proof and our produced proof and syntactically transform the result to bring the set of clause IDs into compact shape. Fifth and fnally, we run lrat-check<sup>8</sup> to check the fnal proof. Only steps two and three of our pipeline are parallelized (step three depending on the particular experiment). We will refer to the frst two steps as solving, the third step as assembly, the fourth step as postprocessing, and the ffth step as checking.

To examine performance overhead for proof-producing parallel and distributed solvers, we compare our proof-producing cloud and parallel solvers (mallob-cacld-p and mallob-capar-p) against six solvers. First, we include the winners of the 2022 SAT competition cloud track (mallob-kicaliglu, using Kissat+CaDiCaL+Lingeling+Glucose), parallel track (parkissat-rs, using Kissat), and sequential track (Kissat\_MAB-HyWalk), as well as the second place

<sup>7</sup> https://github.com/domschrei/mallob/tree/certified-unsat

<sup>8</sup> https://github.com/marijnheule/drat-trim


Table 1: Overview of solved instances: (S)equential, (P)arallel, and (C)loud

solver from the parallel track (mallob-ki, using Lingeling<sup>9</sup> ). We then run a parallel and cloud version of Mallob that runs our described CaDiCaL portfolio without proof production (mallob-capar and mallob-cacld).

Following the SAT competition setup, each cloud solver runs on 100 m6i.4xlarge EC2 instances (16 core, 64GB RAM), each parallel solver runs on a single m6i.16xlarge EC2 instance (64 core, 256GB RAM), and the sequential Kissat\_MAB-HyWalk runs on a single m6i.4xlarge EC2 instance. For each solver, we run the full benchmark suite from the SAT-Competition 2022 (400 formulas) containing both SAT and UNSAT examples. The timeout for the solving step is 1000 seconds, and the timeout for all subsequent steps is set to 4000 seconds.

Since earlier work [14] is no longer competitive in terms of solving time, we only compare proof-checking times. Specifcally, we measure the overhead of checking un-pruned DRAT proofs as the ones produced by [14]. As such, we can get a picture of the performance of the earlier approach if it was realized with state-of-the-art solving techniques. We generate un-pruned DRAT proofs from the original (un-pruned) LRAT proof by stripping out the dependency information and adding delete lines for the last use of each clause.

#### 6.2 Results

First we examine the performance overhead of changing portfolios to enable proof generation as described in Section 5 on the solving process only. Fig. 5 (left) and Table 1 show this data. The PAR-2 metric takes the average time to solve each problem, but counts a timeout result as a 2x penalty (e.g., given our timeout of 1000 seconds, a timeout is scored as taking 2000 seconds). We can see that our CaDiCaL portfolio mallob-capar outperforms the Lingeling-based mallob-ki signifcantly and is almost on par with parkissat-rs. Similarly, mallob-cacld solves eight instances less compared to mallob-kicaliglu but performs almost equally well otherwise. In both cases, we have constructed solvers which are,

<sup>9</sup> mallob-ki employed a Lingeling-based portfolio due to a misconfguration, see: http://algo2.iti.kit.edu/schreiber/downloads/mallob-ki-mallob-li.pdf

Fig. 5: Left: Comparison of solving times. Right: Relation of solving times to assembly and postprocessing times for mallob-cacld-p. Each pair of points corresponds to one instance, the y coordinate denoting the solving time. The left x coordinate denotes solving and assembly time and the right x coordinate denotes solving, assembly, and postprocessing time.

up to a small margin, on par with the state of the art. For our actual proofproducing solvers, mallob-capar-p and mallob-cacld-p, we noticed a more pronounced decline in solving performance. On top of the overhead introduced by proof logging and our preprocessing, we experienced a few technical problems, including memory issues<sup>10</sup>, which resulted in a drop in the number of instances solved and also caused mallob-capar-p with parallel proof production to solve three instances less than with sequential proof production. We believe that we can overcome these issues in future versions of our system. That being said, our proof-producing solvers already outperform any of the solvers at a lower scale.

Second, we examine statistics on proof reconstruction and checking, showing results in Table 2. Since we want to investigate our approaches' overhead compared to pure solving, we measure run times as a multiple of the solving time. (We provide absolute run times in the Appendix, Table 1.) The prefx "Seq." denotes mallob-capar-p with sequential proof production, "Par." denotes mallob-capar-p with distributed proof production run on a single machine, and "Cld." denotes mallob-cacld-p with distributed proof production.

DRAT checking succeeded in 81 out of 139 cases and timed out in 58 cases. For the successful cases, DRAT checking took 24.8× the solving time on average whereas our sequential assembly, postprocessing and checking combined succeeded in 139 cases and only took 3.8× the solving time on average. This result confrms that our approach successfully overcomes the major scalability problems of earlier work [14]. In terms of uncompressed proof sizes, our LRAT

<sup>10</sup> We disabled Mallob's memory panic mode to ensure consistent proof logging.


Table 2: Statistics on proof production and checking. All properties except for fle sizes and pruning factor are given as a multiple of the solving time. We list minima, maxima, medians, arithmetic means, and the 10th and 90th percentiles.

proofs can be about twice as large as the DRAT proofs, which seems more than acceptable considering the dramatic diference in performance. Given that DRAT-based checking was inefective at the scale of parallel solvers, we decided to omit it in our distributed experiments which feature even larger proofs.

Regarding mallob-capar-p with parallel proof production, we can see that the assembly time is reduced from 2.32× down to 0.81× the solving time on average, which also improves overall performance (3.84× to 2.34×).

The results for mallob-cacld-p demonstrate that our proof assembly is feasible, taking around 2.5× the solving time on average. We visualized this overhead and how it relates to the postprocessing overhead in Fig. 5 (right). The proofs produced are about twice as large as for mallob-capar-p. Considering that the proofs originate from 25 times as many solvers, this increase in size is quite modest, which can be explained by our proof pruning. We captured the pruning factor — the number of clauses in all partial proofs divided by the number of clauses in the combined proof — for each instance. Our pruning reduces the derived clauses by a factor of 293.8 on average (17.8 for the median instance), showing that it is a crucial technique to obtain proofs that are feasible to check. As such, we also managed to produce and check a proof of unsatisfability for a formula whose unsatisfability has not been verifed before (PancakeVsInsertSort\_8\_7.cnf).

Lastly, to compare our approach at the largest scale with the state of the art in sequential solving, we computed speedups of mallob-cacld-p, solving times only, over Kissat\_MAB-HyWalk and arrived at a median speedup of 11.5 (Appendix, Table 2). We also analyzed drat-trim checking times of Kissat\_MAB-HyWalk, kindly provided by the competition organizers, and arrived at a median overhead of 1.1× its own solving time (Appendix, Table 3). Going by these measures, Kissat\_MAB-HyWalk takes around 11.5 · 2.1 ≈ 24.2× the solving time of mallob-cacld-p to arrive at a checked result while our complete pipeline only takes 5.1× the solving time for the median instance. This indicates that our approach is considerably faster than the best available sequential solvers.

We can see that the bottleneck of our pipeline shifts from the assembly step further to the postprocessing and checking steps when increasing the degree of parallelism. This is to be expected since the latter steps are, so far, inherently sequential whereas our proof assembly is scalable. While the postprocessing step is a technical necessity in our current setup, we believe that large portions of it can be eliminated in the future with further engineering. For instance, enhancing the LRAT support of our modifed CaDiCaL to natively handle unit clauses in the input would allow us to skip preprocessing and simplify postprocessing.

## 7 Conclusion and Future Work

Distributed clause-sharing solvers are currently the fastest tools for solving a wide range of difcult SAT problems. Nevertheless, they have previously not supported proof-generation techniques, leading to potential soundness concerns. In this paper, we have examined mechanisms to add efcient support for proof generation to clause-sharing portfolio solvers. Our results demonstrate that we can, with reasonable efciency, add support to these solvers to have full confdence that the results they produce are correct.

Following our research, more work is required to reduce overhead in the diferent steps involved and to improve scalability of the end-to-end procedure. This may include designing more efcient (perhaps even parallel) LRAT checkers, examining proof-streaming techniques to eliminate most I/O operations, and improving LRAT support in solver backends. In fact, it might be possible to generalize our approach to DRAT-based solvers by adding additional metadata, and this might allow easier retroftting of the approach onto larger portfolios of solvers. We also intend to investigate producing proofs in Mallob for the case where many problems are solved at once and jobs are rescaled dynamically [26].

#### Acknowledgments

We would like to thank Mario Carneiro for providing help for his FRAT-supporting variant of CaDiCaL; Markus Iser for providing competition data on proof checking; Andrew Gacek for his suggestions to early drafts of this paper; and the re-

viewers for their helpful feedback. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agr. No. 882500). This project was partially supported by the U.S. National Science Foundation grant CCF-2015445.

## References


26-29, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10499, pp. 269– 284. Springer (2017). https://doi.org/10.1007/978-3-319-66107-0\_18


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### An Efcient Proof Checker and Elaborator for SMT Proofs in the Alethe Format<sup>⋆</sup> Carcara:

Bruno Andreotti<sup>1</sup> , Hanna Lachnitt<sup>2</sup> , Haniel Barbosa<sup>1</sup> (B)

<sup>1</sup> Universidade Federal de Minas Gerais, Belo Horizonte, Brazil <sup>2</sup> Stanford University, Stanford, USA hbarbosa@dcc.ufmg.br

Abstract. Proofs from SMT solvers ensure correctness independently from implementation, which is often a requirement when solvers are used in safety-critical applications or proof assistants. Alethe is an established SMT proof format generated by the solvers veriT and cvc5, with reconstruction support in the proof assistants Isabelle/HOL and Coq. The format is close to SMT-LIB and allows both coarse- and fne-grained steps, facilitating proof production. However, it lacks a stand-alone checker, which harms its usability and hinders its adoption. Moreover, the coarsegrained steps can be too expensive to check and lead to verifcation failures. We present Carcara, an independent proof checker and elaborator for Alethe, implemented in Rust. It aims to increase the adoption of the format by providing push-button proof-checking for Alethe proofs, focusing on efciency and usability; and by providing elaboration for coarsegrained steps into fne-grained ones, increasing the potential success rate of checking Alethe proofs in performance-critical validators, such as proof assistants. We evaluate Carcara over a large set of Alethe proofs generated from SMT-LIB problems and show that it has good performance and its elaboration techniques can make proofs easier to check.

## 1 Introduction

Satisfability modulo theories (SMT) solvers are widely used as background tools in various formal method applications, ranging from proof assistants to program verifcation [9]. Since these applications rely on the SMT solver results, they must trust their correctness. However, state-of-the-art SMT solvers are often found to have bugs, despite the best eforts of developers [30, 38]. One way to address this issue is to formally verify the solvers' correctness ("certifying" them), but this approach can be prohibitively expensive and time consuming, besides often requiring performance compromises [19, 20, 27, 33] and increasing the evolution cost of the systems [14]. Alternatively, solvers can produce proofs: independently checkable certifcates that justify the correctness of their results. Since proof checking generally has lower complexity than solving, small and trusted checkers can verify solver results in an scalable manner. Despite the successful adoption

<sup>⋆</sup> This work was partially supported by an Amazon Research Award (Spring 2021), a gift from Amazon Web Services, and the Stanford Center for Automated Reasoning.

of this approach by several SMT solvers [7,13,15,24,37], no standard SMT proof format has emerged, with each system using their own format and independent toolchain. The Alethe<sup>1</sup> format [35] for SMT proofs however can be emitted by the veriT solver for several years [10] and recently<sup>2</sup> also by the cvc5 solver [7]. Moreover, Alethe proofs can be reconstructed within the proof assistants Coq [4, 16] and Isabelle/HOL [11, 36], which allows leveraging solvers who support the format (namely veriT and CVC4, the latter via a translator [16]) for automatic theorem proving. In Isabelle/HOL in particular this integration has been very successful with the veriT solver, signifcantly increasing the success rate of the popular Sledgehammer tactic [36]. The format has been refned and extended through the years [6], being now mature and used by multiple systems, with support for core SMT theories, quantifers, and pre-processing. It allows diferent levels of granularity, so that solvers can provide coarse-grained proofs (which are easier to produce), or take the efort to produce more detailed, fne-grained proofs (which are often easier to check). It provides a term language close to SMT-LIB [8], facilitating printing from solvers as well as validating the connection between proofs and the corresponding proved problems. An overview of the Alethe proof format is given in Section 2.

A signifcant drawback of the Alethe format, however, is that it does not have an independent proof checker. This makes it harder for solvers to adopt the format, since to test their proof production they must be directly integrated with the proof assistants with Alethe reconstructions available. Moreover, these reconstruction methods do not check whether proof steps comply to the format's semantics, but rather are used as hints for internal tactics. Finally, the reconstruction techniques struggle with scalability due to well-known performance issues in the proof assistants [12, 36].

In this paper we introduce Carcara<sup>3</sup> (Section 3), an independent proof checker for Alethe proofs, implemented in a high-performance programming language, Rust. Carcara is open-source and available under the Apache 2.0 license. Proof checking (Section 3.1) is performed by a collection of modules specifc for each rule being checked. The presence of coarse-grained steps in Alethe requires special handling in the checker to account for missing information, which are discussed in detail. Carcara also provides proof elaboration methods (Section 3.2) for particularly impactful coarse-grained steps, so that they can be automatically translated, ofine from the solver, into easier-to-check fne-grained steps. We evaluate (Section 4) Carcara's proof checking on a large set of proofs generated by veriT from SMT-LIB problems, analyzing its performance and effectiveness. The same set of proofs is used to evaluate the proof elaboration methods, where we analyze how checking elaborated proofs compares with the

<sup>1</sup> The format was previously known as the "veriT format", but it has recently been renamed to refect its independence from any individual solver.

<sup>2</sup> cvc5's support for Alethe is still experimental and is under active development. Carcara can actually be instrumental for improving cvc5's support for Alethe.

<sup>3</sup> We follow on the bird theme of the "Alethe" name. Carcar´a is the Portuguese word for the crested caracara, a resourceful bird of prey native of South America.

originals. Our analysis shows that Carcara has performant proof checking and can identify wrong proofs produced by veriT. It also shows that elaboration can in some cases generate proofs signifcantly easier to check than the original ones.

#### 1.1 Related work

Carcara is inspired by the highly-successful DRAT-trim [23] proof checker for SAT proofs, which has been instrumental to the extensive usage of proofs in toolchains involving SAT solvers. It has also provided a basis for numerous advances in SAT proofs, with new proof formats and new checking techniques. We see its performant proof checking and elaboration techniques as the key elements to its success, serving both as an independent checker and as a bridge between solvers and performance-critical checkers, such as proof assistants or certifed checkers. Providing both these features is the main goal of Carcara.

The checker for the Logical Framework with Side Conditions (LFSC) [37], an extension of Edinburgh's Logical Framework (LF) [22], written in C++, is also a stand-alone, non-certifed, highly efcient proof checker. The logical framework, where new rules can be mechanized in a language understood by the checker, provides great fexibility, and LFSC has been successfully used as a proof format for CVC4 [28] and cvc5 [5]. Similarly, Dedukti [25] is an OCaml checker for the λΠ-calculus, another extension of LF, and has been applied to SMT proofs, including to Alethe<sup>4</sup> . However, we are not aware of any mature implementation for this end. Elaboration techniques have not been the focus in these tools. Another diference is that they are based on dependently-typed languages far-removed from SMT-LIB, and generating proofs from SMT solvers for them can be more challenging, as well as relating the resulting proofs to the original problems.

An independent checker has been proposed for SMT proofs [34] from the OpenSMT [26] solver. The checker targets problems with uninterpreted functions and linear arithmetic, but does not support quantifers nor pre-processing. It leverages DRAT-trim for the propositional reasoning and employs Python components for checking the other parts of the proof. Diferent components can use diferent proof formats, and to the best of our knowledge no comprehensive specifcation of the overall format is available. Some SMT solvers, such as SMT-Interpol [24] and cvc5 [7], have internal checkers for their proofs. Since these are not independent from the solvers, they are incomparable to our approach.

## 2 The Alethe Proof Format

Alethe was originally designed [10] as a proof-assistant friendly, easy-to-produce proof format for SMT solvers. A clear specifcation of the rules in a reference document [2] is provided, facilitating reconstruction within proof assistants by avoiding ambiguous syntax or semantics. To facilitate proof production, Alethe uses a term language that directly extends SMT-LIB, thus not requiring solvers

<sup>4</sup> "Verine" library available at https://deducteam.github.io/data/libraries/verine.tar.gz

to translate between diferent term languages when outputting proofs. More importantly, Alethe's proof calculus provides rules with varying levels of granularity, allowing coarse-grained steps and relying on powerful proof checkers for flling in the gaps. This reduces the burden on developers to track all reasoning steps performed by the solver, a notoriously difcult task [7]. The set of rules in the format captures SMT solving (as generally performed by CDCL(T )-based SMT solvers [31]) for problems containing a mix of any of quantifers, uninterpreted functions, and linear arithmetic, as well as multiple pre-processing techniques. As a testament of the format's success, it has been refned and extended throughout the years [6], and has been used as the basis for the integration, with the proof assistants Isabelle/HOL and Coq, of the SMT solvers veriT [6, 36], CVC4 [16] and cvc5 [5, Sec. 3].

Here we briefy overview the Alethe proof format. For the full description of its syntax and semantics please see [2]. We assume the reader is familiar with basic notions of many-sorted equational frst-order logic [17]. Alethe proofs have the form π : φ<sup>1</sup> ∧ · · · ∧ φ<sup>n</sup> → ⊥, i.e., they are refutations, where ⊥ is derived from assumptions φ1, . . . , φ<sup>n</sup> corresponding to the original SMT instance being refuted. Proofs are a series of steps represented as an indexed list of step commands. The command assume is analogous to step but used only for introducing assumptions. The indexed steps induce a directed acyclic graph rooted on the step concluding ⊥ and with the assumptions φ1, . . . , φ<sup>n</sup> as leaves. Steps represent inferences and abstractly have the form

$$[c\_1, \dots, c\_k \rhd \colon i. \psi\_1, \dots, \psi\_l \text{ (ru1e } p\_1, \dots, p\_n\text{)}\,[a\_1, \dots, a\_m]\text{)}$$

where rule names the inference rule used in this step. Every step has an identifer i and concludes a clause, represented as a list of literals ψ1, . . . , ψ<sup>l</sup> . The premises are identifers p1, . . . , p<sup>n</sup> of previous steps or assumptions, and ruledependent arguments are terms a1, . . . , am; steps may occur under a context, which is defned by bound variables or substitutions c1, . . . , ck. Contexts are introduced by the anchor command, which opens subproofs. Subproofs simulate the efect of the ⇒-introduction rule of Natural Deduction, where local assumptions are put in context and the last step in a subproof represents its conclusion and the closing of its context. Besides arbitrary formulas, Alethe has support for contexts which put in scope bound variables and substitutions, which are useful for representing pre-processing techniques in the presence of binders [6], such as Skolemization, let elimination and alpha-conversion.

The structure of Alethe proofs is motivated by SMT solvers generally operating with a cooperation of a SAT solver and multiple engines to perform theory reasoning, deriving new facts and applying simplifcations. The overall proof may be seen as a ground frst-order resolution proof with theory lemmas justifed by closed subproofs. Thus the emphasis on steps concluding clauses as term lists, which avoids ambiguity as to what clause a disjunction represents. An example is that whether a resolution step concluding the term A ∨ B corresponds to the clause [A, B] or [A∨B] depends on the premises. The use of identifers for steps allows representing proofs as directed acyclic graphs rather than trees. Similarly,

```
(set-logic LIA)
(assert (forall ((x Int)) (> x 0)))
(assert (not (forall ((y Int)) (> y 0))))
(check-sat)
```

```
(assume h1 (forall ((x Int)) (> x 0)))
(assume h2 (not (forall ((y Int)) (> y 0))))
(anchor :step t3 :args ((y Int) (:= x y)))
(step t3.t1 (cl (= x y)) :rule refl)
(step t3.t2 (cl (= (> x 0) (> y 0))) :rule cong :premises (t3.t1))
(step t3 (cl (= (forall ((x Int)) (> x 0)) (forall ((y Int)) (> y 0))))
  :rule bind)
(step t4 (cl (not (forall ((x Int)) (> x 0))) (forall ((y Int)) (> y 0)))
  :rule equiv1 :premises (t3))
(step t5 (cl) :rule resolution :premises (t4 h1 h2))
```
Fig. 1: A simple SMT-LIB problem and an Alethe proof of its unsatisfability.

term sharing can be achieved via the SMT-LIB :named attribute or define-fun commands [8, Sec1 4.1.6], which both allow naming subterms. These measures are essential for compact representation of proofs, which can be prohibitively large otherwise. Explicitly providing the conclusion of proof steps aims to both facilitate proof checking (as it allows steps to be verifed locally) and proof production, so coarse-grained rules that do not uniquely defne their conclusions from premises and arguments can be efectively checked.

Example 1. Figure 1 shows an SMT-LIB problem and an Alethe proof of its unsatisfability. Note that in Alethe's concrete syntax clauses are represented via the cl operator (the only exception are conclusions of assume commands, which are considered unit clauses) and the context is not explicitly put in the steps, but rather assumed for all steps under (potentially nested) anchors introducing its elements. For this proof to be valid, three conditions need to be met: each assume command must correspond to an assert command in the original problem, every step command must be valid according to the semantics of its rule, and the proof must end with a step that concludes the empty clause (cl). The proof satisfes the frst condition, as the terms in the assume commands are precisely the asserted terms in the SMT problem. The third condition holds as t5, the last step, concludes the empty clause. For the second condition, step t4 is a direct consequence of the equivalence in its premise, t3, so it remains to check step t3, which is derived from a subproof. The anchor for t3 introduces a bound variable y and a substitution {x 7→ y}. The steps in the subproof contain terms with this new variable and operate under this substitution. The rule refl models refexivity modulo the cumulative, capture-avoiding substitution in the (potentially nested) context, and thus t3.t1 holds since x = y{x 7→ y}. Step t3.t2 is regular congruence with the operator ">" and does not depend on the context. Finally, step t3 holds because its subproof shows the equivalence of the


Fig. 2: Overview of the architecture of Carcara.

bodies of the quantifiers under the renaming, introduced in the context, into a fresh variable relative to the left-hand side quantifier. Since all steps follow the expected semantics, all conditions are met and the proof is valid.

In the next section we show how Carcara checks the above conditions, highlighting some challenging rules and showing how some coarse-grained steps are elaborated into proofs potentially simpler to check.

## 3 Architecture and core components

Carcara is developed in the Rust programming language, and is publicly available<sup>5</sup> under the Apache 2.0 license. Its architecture is shown in Figure 2. It provides both a command line interface and bindings for a Rust API. The main component is the proof checking one, with 6.5k LOC, which is a collection of procedures for each rule to be checked (Section 3.1). The elaborator has 1k LOC and has an interface to the cvc5 solver, as well as a collection of elaboration methods and a post-processing module to knit together the elaborated proof (Section 3.2). The other components together have 6k LOC, including a handwritten 2k LOC SMT-LIB and Alethe parser, and an Alethe printer.

The inputs of Carcara are an SMT-LIB problem ϕ and an Alethe proof π : ϕ → ⊥. In proof-checking mode it checks each step in π with the respective procedure for its rule and prints either valid, when all steps are successfully checked and the proof concludes the empty clause (cl), holey when π is valid but contains steps that are not checked ("holes"), and invalid otherwise, together with an error message indicating the first step where checking failed and why. In proof-elaboration mode it converts π into π : ϕ → ⊥, where some steps may be replaced by a series of steps elaborating them, and prints π .

<sup>5</sup> https://github.com/ufmg-smite/carcara

#### 3.1 Checking Alethe proofs

First the original SMT-LIB problem and its Alethe proof are parsed. The problem provides the declaration of sorts and symbols that may be used in the proof, as well as the original assertions, which must match the assumptions in the proof. Symbol defnitions in the proof for term sharing are expanded during parsing. Terms are internally represented as directed acyclic graphs, using hash consing for maximal sharing and constant-time syntactically-equality tests. The proof is represented internally as an array of command objects, each corresponding either to an Alethe assume or step command, or a subproof, which is represented as a step with an (arbitrarily) nested array of command objects. Step identifers are converted into indices for the arrays, so that access is constant-time.

Each command is checked individually by the rule checker corresponding to the rule in that command. That component takes as input the conclusion, the conclusions of its premises, and the arguments of the command, as well as the context it is in. As the Alethe format currently has 90 possible rules, Carcara has 90 rule checkers. We highlight below some of the rule checkers as well as some challenges for checking Alethe proofs and how we addressed them.

Term equality tests. Terms introduced by Alethe rules may have equality subterms implicitly reordered, but the rules are still valid if the conclusion changes only in this way. This fexibility is motivated by solvers often internally representing equalities ignoring order, which may lead to equalities being implicitly reordered when appearing in facts derived by these components. The congruence closure procedure [29] commonly used in SMT is an example of such a component. Since equality symmetry justifes these reorderings, but keeping track of all the changes can be challenging, the format allows them to be implicit.

As a consequence, syntactic equality cannot be the only test for whether two terms are the same. For example, the terms (and p (= a b)) and (and p (= b a)) may be required to be equal. Thus Carcara tests equality in two phases: frst if they are syntactically equal, in which case they can be compared in constant time; otherwise they are simultaneously traversed and equality subterms in the same position are compared modulo equality reordering, failing as soon as subterms difer. We refer to this as a polyequal test. As we will see in Section 4.1, these tests can be a substantial portion of overall checking time in some cases.

Checking initial assumptions. The initial assume commands in an Alethe proof must correspond to assertions in the original problem, so their checker searches through the assertions to fnd a match. In general, this can be done efciently: assertions are stored in a hash set during parsing, and these assume commands are valid if their conclusions occur in the set. However, assume commands are also impacted by implicit equality reordering, thus requiring polyequal tests. When an assumption does not occur in the assertions hash set, the checker attempts to match it to each assertion in turn, performing a polyequal test. As a consequence, when the original problem is large and the assertions similar and deep, checking assume steps may dominate overall checking time, as our experiments show (Section 4.1).

Checking contextual steps. Steps within subproofs may depend on their context to be valid, so before checking these steps, a context object is built based on the anchor opening the subproof. As shown in Section 2, context elements on which rules may depend are bound variables and substitutions. The former make new symbols available to build terms, while the latter allows steps to be valid modulo applying these substitutions.

Substitutions in Alethe are capture-avoiding, renaming bound variables during application, which facilitates producing proofs with binders [6]. However, it has the side efect of also preventing constant-time equality tests, since we must rather check α-equivalence, i.e., a term with bound variables may be required to be equal<sup>6</sup> to the result of applying a substitution that may have renamed some of these variables. To avoid spurious renaming when applying substitutions, the checker only renames bound variables which occur as free variables in the substitution range. Since computing free variables is itself costly, it is done lazily, only when the substitution is to be applied under a binder, and the result is cached.

Note that, as subproofs can be nested, the substitution in context for a step is the composition of a stack of substitutions σ1, . . . , σn. To avoid sequential application of substitutions, Alethe requires the substitution σ in context to be a cumulative substitution in which every term t in the range of the substitution σi+1 is replaced by tσ<sup>i</sup> . Thus σ can be applied simultaneously and correspond to a sequential application of σ1, . . . , σn. As a result of these requirements, handling and applying substitutions can be expensive in Alethe, as shown in Section 4.1.

Finally, the rules enclosing subproofs must be checked to whether their conclusions are valid from the introduced context and resulting subproof. For example, the bind rule in Example 1 requires that the bound variable in the quantifer at the right-hand side of the equality matches the range of the substitution put in context for its subproof. The subproof rule, which introduces local assumptions a1, . . . , an, and concludes a formula ¬a ′ <sup>1</sup> ∨ · · · ¬a ′ <sup>n</sup> ∨ φ, requires that the enclosed subproof derives φ and that each a<sup>i</sup> match a ′ i .

We now highlight coarse-grained rules whose checking is more intricate and expensive.

Resolution. The rule resolution in Alethe captures hyper-resolution on ground frst-order clauses, i.e.,

$$\frac{C\_1 \cdot \cdots \cdot C\_n}{C} \text{ \*\*resolution}, p\_1, p\_2, \dots, p\_{n-1}$$

where C1, . . . , C<sup>n</sup> are premises; p<sup>i</sup> the pivot for the binary resolution between C<sup>i</sup> and Ci+1, occurring as is in C<sup>i</sup> and as ¬p<sup>i</sup> in Ci+1; and C the conclusion. While it is simple to check such steps, Alethe allows resolution steps to not provide the pivots, for the sake of facilitating proof-production in solvers. Checking such steps requires searching for the pivots and in which binary resolution they are to

<sup>6</sup> Since Alethe has bound-variable renaming rules, the checker requires names to be handled properly, rather than normalizing all binders internally via De Brujin indices.

be used, but Carcara applies an incomplete heuristic where pivots are inferred between the diference of literals in the premises and in the conclusion (i.e., literals not in the conclusion must have been pivots eventually eliminated). If that fails, we apply a reverse unit propagation (RUP) test [21], i.e., the step is valid if we can derive a confict via Boolean Constraint Propagation from the premises and the negated conclusion. Note that Carcara also allows the pivots to be provided as arguments, in which case checking is simple, as expected.

AC simplifcation. Normalization modulo associativity and commutativity for conjunction and disjunction can be represented in Alethe via the ac simp rule, which establishes the equality between a term t and a term t ′ that is t but with nested occurrences of these connectives fattened and duplicate arguments removed, until a fx-point. While this simplifcation is performance-critical [6, Sec. 4.6], checking the corresponding rule requires traversing t and performing the normalization, which is proportional to t's depth.

Arithmetic reasoning. Apart from simplifcation rules, arithmetic reasoning in Alethe is mainly captured by two rules: la generic and lia generic. Both rules conclude a clause of negated linear inequalities, which is valid due to the Farkas' lemma [18] guaranteeing that there exists a linear combination of these inequalities equivalent to ⊥. The la generic rule takes as arguments the coefcients of this linear combination, with which the rule can be checked by applying simple (but costly) operations on the coefcients to reduce the linear combination to ⊥ (see [2, Sec 5.4, Rule 9] for the algorithm). The checker uses GMP [1] for efciently performing the required computations with the coefcients.

While la generic can be checked efectively, lia generic cannot. It provides only the negated inequalities, which would require searching for the coeffcients to perform the checking, essentially requiring the arithmetic solving to be repeated in the checker. As a consequence this rule is considered a hole and Carcara ignores it during proof checking, issuing a warning.

#### 3.2 Elaborating Alethe proofs

In order to mitigate bottlenecks in checking some Alethe steps, Carcara can also elaborate Alethe proofs into easier-to-check ones by flling in missing details from the original proofs. This is done by replacing coarse-grained steps with fnegrained proofs of their conclusions, producing a new overall proof equivalent to the original, but with some coarse-grained steps broken down into fne-grained ones. Formally, a proof as the one below on the left, with a coarse step concluding ψ from premises ψ1, . . . , ψn, is elaborated into the proof on the right where the coarse step is replaced by a proof π, with fne-grained steps, rooted on ψ and with ψ1, . . . , ψ<sup>n</sup> as leaves:

ψ<sup>1</sup> · · · ψ<sup>n</sup> ψ coarseStep · · · Θ rule <sup>⇒</sup>elab ψ<sup>1</sup> · · · ψ<sup>n</sup> π ψ · · · Θ rule

```
(step t2.t1 (cl (not (= a b)) (not (= b c)) (not (= c d)) (= a d))
  :rule eq_transitive)
(step t2.t2 (cl (not (= b a)) (= a b)) :rule eq_symmetric)
(step t2.t3 (cl (not (= c b)) (= b c)) :rule eq_symmetric)
(step t2.t4 (cl (not (= c d)) (= a d) (not (= b a)) (not (= c b)))
  :rule resolution :premises (t2.t1 t2.t2 t2.t3))
(step t2 (cl (not (= b a)) (not (= c d)) (not (= c b)) (= a d))
  :rule reordering :premises (t2.t4))
```
Fig. 3: Elaboration of an eq transitive step. Note the new eq transitive step is easy to check, and the new t2 step has the same conclusion as the original.

Note the expansion only afects the proof locally, since any step using the conclusion of the coarse step as a premise may use the conclusion of π interchangeably.

There are many Alethe rules whose checking would be simpler if elaborated, but we have focused initially on what we believe can be more impactful: removing implicit equality reordering, and thus polyequal tests, which afects virtually every Alethe rule; and providing checkable justifcations for lia generic steps, to remove holes from proofs. Before detailing these methods, we illustrate the elaboration process with an example.

Elaborating transitivity steps. The eq transitive rule concludes a valid clause composed of negated equalities followed by a single positive equality, such that the negated equalities form a transitive chain resulting in the fnal equality. However, the specifcation does not impose an order on the negated equalities (which can, remember, also be implicitly reordered). So the following step must also be valid, with a "shufed" chain:

```
(step t2 (cl (not (= b a)) (not (= c d)) (not (= c b)) (= a d))
  :rule eq_transitive)
```
This permissive specifcation again facilitates proof production (particularly from congruence closure procedures), but requires the eq transitive checker, for every link in the chain, to potentially traverse the whole clause searching for the next one, performing polyequal tests throughout. The goal of elaborating eq transitive steps is that steps like t2 are justifed in a fne-grained manner. If we changed the conclusion of the step, this would impact the rest of the proof, if t2 is used anywhere as a premise. We instead introduce a fne-grained proof for t2's conclusion, as shown in Figure 3: an easy-to-check eq transitive step (t2.t1), eq symmetric steps to fip the equalities (t2.t2, t2.t3), resolution (t2.t4) and reordering (t2.t5) steps to derive the original conclusion.

Elaborating implicit equality reordering. Similarly to above, steps concluding a term t, with some subterm equality implicitly reordered, have their conclusion replaced by t ′ where that subterm is not reordered and a fne-grained proof of the conversion of t ′ into t is added. Figure 4 illustrates this process for an assume

```
(set-logic QF_UF)
(declare-const a Bool)
(declare-const b Bool)
(declare-const p Bool)
(assert (not (or p (= a b))))
(assert (or p (= b a)))
(check-sat)
```
Fig. 4a: An example SMT problem in-

stance.

```
(assume h1 (not (or p (= a b))))
(assume h2 (or p (= a b)))
(step t3 (cl) :rule resolution
              :premises (h1 h2))
```
Fig. 4b: An Alethe proof for the SMT problem in Figure 4a. Notice that this proof makes use of implicit reordering of equalities in h2.

```
(assume h1 (not (or p (= a b))))
(assume h2 (or p (= b a)))
(step h2.t1 (cl (= (= b a) (= a b))) :rule equiv_simplify)
(step h2.t2 (cl (= (or p (= b a)) (or p (= a b))))
    :rule cong :premises (h2.t1))
(step h2.t3 (cl (not (or p (= b a))) (or p (= a b)))
    :rule equiv1 :premises (h2.t2))
(step h2.t4 (cl (or p (= a b))) :rule resolution :premises (h2 h2.t3))
(step t3 (cl) :rule resolution :premises (h1 h2.t4))
```
Fig. 4c: The elaborated proof without implicit equality reordering.

command, where note that step h2.t1 is the rewriting justifying the equality reordering of the subterm and the following steps rebuild the original conclusion.

In the original proof, the assume command h2 introduces the term (or p (= a b)), which is the original assertion (or p (= b a)) with the equality (= b a) implicitly reordered. In the elaborated proof (Figure 4c), the conclusion of h2 is replaced by one without implicit equality reordering, but step t3 expects the original conclusion. The steps h2.t1 to h2.t4 convert the new h2 conclusion into the original one, relying on standard equality reasoning and on resolution to connect the introduced steps. Notice that the t3 step, which originally refered to h2 as a premise, now refers to h2.t4.

When applied to every concluding terms with implicit equality reordering, the result of this elaboration method is a proof where equality tests are only syntactic, erasing the overhead of checking assumptions and polyequal tests.

Elaborating lia generic steps. As discussed in Section 3.1, Carcara considers lia generic steps holes in the proof, as their checking is as hard as solving. Since our goal is to keep Carcara as simple as possible, we rely on an external tool to elaborate the step by solving a problem corresponding to it in a proof-producing manner, then import the proof, checking it and guaranteeing that it is sound to replace the original step. Any tool producing detailed Alethe proofs for linearinteger arithmetic reasoning can be used to this end, but currently only cvc5 can do so [7]. We note that cvc5 currently has the limitation that its Alethe proofs may contain rewrite steps not yet modeled in the Alethe simplifcation rules [2, Sec 5.11], and are thus not supported by Carcara. They are considered holes, but since these are generally simple simplifcation rules, are much less harmful than lia generic ones.

In detail, the elaboration method, when encountering a lia generic step S concluding the negated inequalities ¬l<sup>1</sup> ∨ · · · ∨ ¬ln, generates an SMT-LIB problem asserting l<sup>1</sup> ∧ · · · ∧ l<sup>n</sup> and invokes cvc5 on it, expecting an Alethe proof π : (l<sup>1</sup> ∧ · · · ∧ ln) → ⊥. Carcara will check each step in π and, if they are not invalid, will replace step S in the original proof by a proof of the form:

```
(anchor :step S.t_m+1)
(assume S.h_1 l1)
...
(assume S.h_n ln)
...
(step S.t_m (cl false) :rule ...)
(step S.t_m+1 (cl (not l1) ... (not ln) false) :rule subproof)
(step S.t_m+2 (cl (not false)) :rule false)
(step S (cl (not l1) ... (not ln))
  :rule resolution :premises (S.t_m+1 S.t_m+2))
```
where steps S.h 1 until S.t m are imported from the cvc5 proof. As a result the lia generic step S in the original proof will have been replaced by a detailed justifcation whose correctness can be independently established by Carcara.

## 4 Evaluation

We evaluate Carcara for proof-checking performance and the impact of elaboration methods. We use the veriT solver [13], version 2021.06-40-rmx, to generate Alethe proofs from all problems in the SMT-LIB benchmark library<sup>7</sup> whose logic it supports, with a 120 seconds timeout. We did not consider cvc5 as its support for Alethe is not yet as mature or complete. The veriT solver produced 39,229 proofs. They total 92gb, but vary greatly in size. The biggest proof has 4.5gb, fourteen have at least 1gb and over a hundred have more than 100mb, while almost 90% are under 1mb. All the experiments were run on a server equipped with AWS Graviton2 2.5 GHz ARM CPUs, with 4 GB of memory for each job.

## 4.1 Proof checking

We ran Carcara on each proof until checking succeeded or failed. Only 378 had checking failures, which were due to incorrect<sup>8</sup> steps for quantifer simplifcations (Skolemization and elimination of one-point quantifers) and AC normalization. The issues have been communicated to the solver developers. For the successful proofs, a summary is given in Table 1, for each SMT-LIB logic, with the cumulative solving time by veriT and checking time by Carcara.

<sup>7</sup> https://smtlib.cs.uiowa.edu/benchmarks.shtml

<sup>8</sup> In a superfcial analysis the steps seemed sound, but the proofs were incorrect.


Table 1: Total solving and proof-checking time per logic for veriT and Carcara.

As expected, the comparison is heavily logic-dependent. In quantifed logics (top of the table), checking is generally signifcantly cheaper than solving. An outlier is AUFLIRA, which is explained by the problems to which veriT could produce proofs being all both simple to solve and check. In logics such as QF UF and QF IDL, which can have very large proofs, overall checking time is comparable to solving time, if still noticeably smaller in total.

When comparing per-problem, for the large majority of proofs (81.61%) the checking time was smaller than the solving time. Furthermore, for 3.96% of the proofs, checking was more than 10 times faster than solving the problem, and for 0.96%, that ratio was of 100 times. There were only 24 instances where the checking time was more than 10 times bigger than the solving time, and, in all of them, the checking time was less than 0.6 seconds.

We also evaluate the per-rule frequency, as shown in Figure 5b, and checking time, with Figure 6a showing the cumulative checking times and Figure 5a a box plot considering individual rule checks. The lower whisker represents the 5th percentile, the lower bound of the box represents the frst quartile, the line inside the box represents the median, the upper bound of the box represents the third quartile, and the upper whisker represents the 95th percentile<sup>9</sup> . Rules that are rare and have negligible checking time are omitted. The data is gathered from proof checking in all proofs, even those that failed.

The assume commands account for a large proportion of the total time. This is justifed by their checking, due to implicit equality reordering, being potentially proportional to both the quantity and the depth of assertions in the original problems. The box plot shows that the worse cases lead to the most expensive rule checks among all rules.

<sup>9</sup> The plots follow the same criteria of the evaluation in [36].

Fig. 5a: Box plot for checking time per rule.

rule (only most frequent shown).

Rules with highest overall time are resolution, ac simp and la generic. For resolution this is explained mainly by its high frequency (this is similarly the case for cong), as well as by some more expensive checks (veriT does not provide pivots), as shown in the box plot. As for ac simp and la generic, while they are much less frequent, their checking is expensive (Section 3.1).

Other expensive rules to note are those related to contexts involving substitutions<sup>10</sup>, specially let, for let elimination, and refl. It is common for let subproofs to be deeply nested, leading to large cumulative substitutions needing to be computed. As for refl, besides being one of the most frequent rules, about a third of its total time is spent on polyequal tests, and most of the rest is related to handling and applying substitutions, as well as checking alpha-equivalence.

#### 4.2 Proof elaboration

We ran Carcara, on each successfully checked proof, in proof-elaboration mode with the elaboration of transitivity steps and, more importantly, the removal of implicit equality reordering. On average, excluding parsing, elaboration takes 40% of the time required for checking. We focus on the impact on proof checking of the result of elaboration.

In Figure 7 we have the comparison, per proof, of the proof-checking time on the original proof and on the elaborated one (excluding parsing time). There is not a clear winner, but note that for harder proofs (those originally requiring at least 1s), checking the elaborated proof is often significantly faster. A per-rule analysis is shown in Figure 6b, with the proportion of the checking time spent

<sup>10</sup> The ones shown in the plots are let, bind, sko forall, and onepoint.

Fig. 6a: Total checking time per rule. Fig. 6b: Times after elaboration.

in each rule, for the elaborated proofs. Comparing to Figure 6a, the checking time for assume steps becomes negligible in the elaborated proofs, as checking them now amounts to checking occurrence in a hash set. The overall time for refl also decreases, but only by 10%. This can be explained by the refl steps added during elaboration. While checking each refl is now potentially cheaper, this is offset by their increased number. Note that these additions also impact other rules, specially cong, whose cumulative time increased by 13%. Overall, proof elaboration resulted in a net improvement in checking time of 6%. Parsing time, however, increased, which made the overall runtime for proof-checking the original proofs virtually the same as for the elaborated proofs.

The results indicate that elaborating implicit equality reordering is not always worth it, specially for highperformant tools. However, it successfully yields proofs not requiring polyequal tests, which may help performance in other scenarios. For example, the reconstruction of Alethe proofs in Isabelle/HOL requires equality tests to be done by applying a normalizer to both terms and then testing them for syntactic equality. This leads to performance issues for reconstructing some rules [36], which this elaboration method would avoid.

Fig. 7: Before vs after elaboration.

#### Elaborating lia generic steps. In our

benchmark set, 276 proofs contain a total of 127k lia generic steps. As a proof of concept we instrumented Carcara to apply the elaboration method described in Section 3.2 via a connection with cvc5<sup>11</sup>. Due to the still experimental Alethe proof production in cvc5, we only considered SMT problems derived

<sup>11</sup> cvc5-1.0.2, modified for better Alethe support, provided by the cvc5 team.

from lia generic steps in proofs for the QF UFLIA and QF LIA logics. This excluded only 15 proofs, each containing exactly one lia generic step. We ran Carcara on proof-elaboration mode with a 30 minute timeout for each proof. For each lia generic step, cvc5 was invoked with a 30s timeout and the resulting Alethe proof, if any, replaced the original lia generic step, as described in Section 3.2.

Of the 261 proofs, Carcara timed out on only 13 of them. Of the remaining 248 proofs, 82 still contained lia generic steps after elaboration, either because cvc5 timed out when solving the generated problem, or because the cvc5 proofs contained lia generic steps of their own. Note however that they are still improvements over the original lia generic steps, since generally less inequalities are involved and the steps are potentially simpler to solve, were the process to be repeated. Similarly, although all elaborated proofs contained holes from cvc5 rewriting steps, these are much simpler than the original lia generic ones.

As with the elaboration of implicit equality reordering, this elaboration method would be particularly impactful in scenarios such as Alethe reconstruction in Isabelle/HOL. Steps such as lia generic are reconstructed via limited internal automation for arithmetic reasoning, which is known to fail [36, Sec. 4.3].

## 5 Conclusion and future work

Our evaluation shows that Carcara has good performance and can identify shortcomings in the proof-production of established SMT solvers. Carcara can also elaborate proofs into demonstrably easier-to-check ones, which can have a signifcant impact, for example, if it is used as a bridge between solvers and proof assistants. Extending Carcara to convert Alethe proofs into other formats would also allow the elaboration techniques to beneft other toolchains.

As future work, we will add support for parallel proof checking, since steps in the same context can be checked completely independently. We will also add new elaboration methods for resolution and ac simp, which occasionally are bottlenecks, and will provide elaboration for rewrite rules, which can change signifcantly between diferent solvers, complicating proof-production if solvers have to phrase their rewrites with a fxed set of rules. An automatic conversion into a defned set of rewrite rules, as described in [32], would address this issue.

Finally, we expect Carcara to facilitate improving how we use Alethe proofs. For example, our large-scale evaluation shows the signifcant time spent on contextual substitutions, which is mainly due to the Alethe requirement of only applying substitutions simultaneously. Extending the proof format to allow other substitution application strategies may be benefcial for diferent scenarios, as proof production in some solvers has indicated [7, Sec 5.1]. In general, extensions to the format (for example, to other logical theories) can be done in a more informed way with the help of an independent checker.

Acknowledgments. We thank the reviewers for their helpful suggestions to improve this paper as well as Carcara. We thank Hans-J¨org Schurr for his extensive work in detailing the semantics of Alethe, which greatly facilitated developing Carcara.

Data Availability Statement. The datasets generated and analyzed during the current study are available in the Zenodo repository: https://zenodo.org/ record/7574451 [3].

## References


and Zhong Shao, editors, Certifed Programs and Proofs - First International Conference, CPP 2011, Kenting, Taiwan, December 7-9, 2011. Proceedings, volume 7086 of Lecture Notes in Computer Science, pages 183–198. Springer, 2011.


Structures for Computation and Deduction (FSCD), volume 167 of LIPIcs, pages 35:1–35:16. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2020.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Constraint Solving/Blockchain**

## The Packing Chromatic Number of the Infinite Square Grid is 15

Bernardo Subercaseaux and Marijn J. H. Heule

Carnegie Mellon University, Pittsburgh, PA 15203, USA {bsuberca,mheule}@cs.cmu.edu

Abstract. A packing k-coloring is a natural variation on the standard notion of graph k-coloring, where vertices are assigned numbers from {1,...,k}, and any two vertices assigned a common color c ∈ {1,...,k} need to be at a distance greater than c (as opposed to 1, in standard graph colorings). Despite a sequence of incremental work, determining the packing chromatic number of the infinite square grid has remained an open problem since its introduction in 2002. We culminate the search by proving this number to be 15. We achieve this result by improving the best-known method for this problem by roughly two orders of magnitude. The most important technique to boost performance is a novel, surprisingly effective propositional encoding for packing colorings. Additionally, we developed an alternative symmetry breaking method. Since both new techniques are more complex than existing techniques for this problem, a verified approach is required to trust them. We include both techniques in a proof of unsatisfiability, reducing the trusted core to the correctness of the direct encoding.

Keywords: Packing coloring · SAT · Verification.

## 1 Introduction

Automated reasoning techniques have been successfully applied to a variety of coloring problems ranging from the classical computer-assisted proof of the Four Color Theorem [1], to progress on the Hadwiger-Nelson problem [21], or improving the bounds on Ramsey-like numbers [19]. This article contributes a new success story to the area: we show the packing chromatic number of the infinite square grid to be 15, thus solving via automated reasoning techniques a combinatorial problem that had remained elusive for over 20 years.

The notion of packing coloring was introduced in the seminal work of Goddard et al. [10], and since then more than 70 articles have studied it [3], establishing it as an active area of research. Let us consider the following definition.

Definition 1. A packing k-coloring of a simple undirected graph G = (V,E) is a function f from V to {1,...,k} such that for any two distinct vertices u, v ∈ V , and any color c ∈ {1,...,k}, it holds that f(u) = f(v) = c implies d(u, v) > c.

© The Author(s) 2023

Both authors are supported by the U.S. National Science Foundation under grant CCF-2015445.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. – , 2023. https://doi.org/10.1007/978-3-031-30823-9 20 389 406

Note that by changing the last condition to d(u, v) > 1 we recover the standard notion of coloring, thus making packing colorings a natural variation of them. Intuitively, in a packing coloring, larger colors forbid being reused in a larger region of the graph around them. Indeed, packing colorings were originally presented under the name of broadcast coloring, motivated by the problem of assigning broadcast frequencies to radio stations in a non-conflicting way [10], where two radio stations that are assigned the same frequency need to be at distance greater than some function of the power of their broadcast signals. Therefore, a large color represents a powerful broadcast signal at a given frequency, that cannot be reused anywhere else within a large radius around it, to avoid interference. Minimizing the number of colors assigned can thus be interpreted as minimizing the pollution of the radio spectrum. The literature has preferred the name packing coloring ever since [3].

Analogously to the case of standard colorings, we can naturally define the notion of packing chromatic number, and study its computation.

Definition 2. Given a graph G = (V,E), define its packing chromatic number χρ(G) as the minimum value k such that G admits a packing k-coloring.

Example 1. Consider the infinite graph with vertex set Z and with edges between consecutive integers, which we denote as Z<sup>1</sup>. A packing 3-coloring is illustrated in Figure 1. On the other hand, by examination one can observe that it is impossible to obtain a packing 2-coloring for Z<sup>1</sup>.

Fig. 1: Illustration of a packing 3-coloring for Z<sup>1</sup>.

While Example 1 shows that χρ(Z<sup>1</sup>) = 3, the question of computing χρ(Z<sup>2</sup>), where <sup>Z</sup><sup>2</sup> is the graph with vertex set <sup>Z</sup> <sup>×</sup> <sup>Z</sup> and edges between orthogonally adjacent points (i.e., points whose <sup>1</sup> distance equals 1), has been open since the introduction of packing colorings by Goddard et al. [10]. On the other hand, it is known that <sup>χ</sup>ρ(Z<sup>3</sup>) = <sup>∞</sup> (again considering edges between points whose <sup>1</sup> distance equals 1) [9]. The problem of computing 3 <sup>≤</sup> <sup>χ</sup>ρ(Z<sup>2</sup>) ≤ ∞ has received significant attention, and it is described as "the most attractive [of the packing coloring problems over infinite graphs]" by Breˇsar et al. [3]. We can now state our main theorem, providing a final answer to this problem.

## Theorem 1. χρ(Z<sup>2</sup>) = 15.

An upper bound of 15 had already been proved by Martin et al. [18], who found a packing 15-coloring of a 72 × 72 grid that can be used for periodically tiling the entirety of Z<sup>2</sup>. Therefore, the main contribution of our work consists of proving that 14 colors are not enough for Z<sup>2</sup>. Table 1 presents a summary of the historical progress on computing χρ(Z<sup>2</sup>). It is worth noting that amongst the computer-generated proofs (i.e., all since Soukal and Holub [22] in 2010), ours is the first one to be formally verified, see Section 4.


Table 1: Historical summary of the bounds known for χρ(Z2).

For any k ≥ 4, the problem of determining whether a graph G admits a packing 4-coloring is known to be NP-hard [10], and thus we do not expect a polynomial time algorithm for computing χρ(·). This naturally motivates the use of satisfiability (SAT) solvers for studying the packing chromatic number of finite subgraphs of Z<sup>2</sup>. The rest of this article is thus devoted to proving Theorem 1 by using automated reasoning techniques, in a way that produces a proof that can be checked independently and that has been checked by verified software.

## 2 Background

We start by recapitulating the components used to obtain a lower bound of 14 in our previous work [23]. Naturally, in order to prove a lower bound for Z<sup>2</sup> one needs to prove a lower bound for a finite subgraph of it. As in earlier work, we consider disks (i.e., 2-dimensional balls in the 1-metric) as the finite subgraphs to study [23]. Concretely, let Dr(v) be the subgraph induced by {<sup>u</sup> <sup>∈</sup> <sup>V</sup> (Z2) <sup>|</sup> <sup>d</sup>(u, v) <sup>≤</sup> <sup>r</sup>}. To simplify notation, we use <sup>D</sup><sup>r</sup> as a shorthand for Dr((0, 0)), and we let Dr,k be the instance consisting of deciding whether D<sup>r</sup> admits a packing k-coloring. Moreover, let Dr,k,c be the instance Dr,k but enforcing that the central vertex (0, 0) receives color c (Fig. 2).

For example, a simple lemma of Subercaseaux and Heule [23, Proposition 5] proves that the unsatisfiability of <sup>D</sup>3,6,<sup>3</sup> is enough to deduce that <sup>χ</sup>ρ(Z<sup>2</sup>) <sup>≥</sup> 7. We will prove a slight variation of it (Lemma 2) later on in order to prove Theorem 1, but for now let us summarize how they proved that D12,13,<sup>12</sup> is unsatisfiable.

Encodings. The direct encoding for Dr,k,c consists simply of variables xv,t stating that vertex v gets color t, as well as the following clauses:


$$
\overline{x\_{u,t}} \vee \overline{x\_{v,t}}, \quad \forall t \in \{1, \ldots, k\}, \forall u, v \in V \text{ s.t. } 0 < d(u, v) \le t,
$$

Fig. 2: Illustration of satisfying assignments for D3,7,<sup>3</sup> and D3,6,6. On the other hand, D3,6,<sup>3</sup> is not satisfiable.

3. (center clause) x(0,0),c.

This amounts to O(r<sup>2</sup>k<sup>3</sup>) clauses [23]. The recursive encoding is significantly more involved, but it leads to only O(r<sup>2</sup>k log k) clauses asymptotically. Unfortunately, the constant involved in the asymptotic expression is large, and this encoding did not give them practical speed-ups [23].

Cube And Conquer. Introduced by Heule et al. [13], the Cube And Conquerapproach aims to split a SAT instance ϕ into multiple SAT instances ϕ1,..., ϕ<sup>m</sup> in such a way that ϕ is satisfiable if, and only if, at least one of the instances ϕ<sup>i</sup> is satisfiable; thus allowing to work on the different instances ϕ<sup>i</sup> in parallel. If ψ = (c<sup>1</sup> ∨ c<sup>2</sup> ∨···∨ cm) is a tautological DNF, then we have

$$\text{SAT}(\varphi) \iff \text{SAT}(\varphi \land \psi) \iff \text{SAT}\left(\bigvee\_{i=1}^{m} (\varphi \land c\_i)\right) \iff \text{SAT}\left(\bigvee\_{i=1}^{m} \varphi\_i\right),$$

where the different ϕ<sup>i</sup> := (ϕ ∧ ci) are the instances resulting from the split.

Intuitively, each cube c<sup>i</sup> represents a case, i.e., an assumption about a satisfying assignment to ϕ, and soundness comes from ψ being a tautology, which means that the split into cases is exhaustive. If the split is well designed, then each ϕ<sup>i</sup> is a particular case that is substantially easier to solve than ϕ, and thus solving them all in parallel can give significant speed-ups, especially considering the sequential nature of CDCL, at the core of most solvers. Our previous work [23] proposed a concrete algorithm to generate a split, which already results in an almost linear speed-up, meaning that by using 128 cores, the performance gain is roughly a ×60 factor.

Symmetry Breaking. The idea of symmetry breaking [6] consists of exploiting the symmetries that are present in SAT instances to speed-up computation. In particular, Dr,k,c instances have 3 axes of symmetry (i.e., vertical, horizontal, and diagonal) which allowed for close to an 8-fold improvement in performance for proving D12,13,<sup>12</sup> to be unsatisfiable. The particular use of symmetry breaking in our previous approach [23] was happening at the Cube And Conquer level, where out of the sub-instances ϕi,...,ϕ<sup>m</sup> produced by the split, only a <sup>1</sup>/8-fraction of them had to be solved, as the rest were equivalent under isomorphism.

Verification. Arguably the biggest drawback of our previous approach proving a lower bound of 14 is that it lacked the capability of generating a computercheckable proof. To claim a full solution to the 20-year-old problem of computing χρ(Z2) that is accepted by the mathematics community, we deem paramount a fully verifiable proof that can be scrutinized independently.

The most commonly-used proofs for SAT problems are expressed in the DRAT clausal proof system [11]. A DRAT proof of unsatisfiability is a list of clause addition and clause deletion steps. Formally, a clausal proof is a list of pairs s1, C1,...,sm, Cm, where for each i ∈ 1,...,m, s<sup>i</sup> ∈ {a, d} and C<sup>i</sup> is a clause. If s<sup>i</sup> = a, the pair is called an addition, and if s<sup>i</sup> = d, it is called a deletion. For a given input formula ϕ0, a clausal proof gives rise to a set of accumulated formulas ϕ<sup>i</sup> (i ∈ {1,...,m}) as follows:

$$\varphi\_i = \begin{cases} \varphi\_{i-1} \cup \{C\_i\} & \text{if } \mathbf{s}\_i = \mathbf{a} \\ \varphi\_{i-1} \nmid \{C\_i\} & \text{if } \mathbf{s}\_i = \mathbf{d} \end{cases}$$

Each clause addition must preserve satisfiability, which is usually guaranteed by requiring the added clauses to fulfill some efficiently decidable syntactic criterion. The main purpose of deletions is to speed up proof checking by keeping the accumulated formula small. A valid proof of unsatisfiability must end with the addition of the empty clause.

## 3 Optimizations

Even with the best choice of parameters for our previous approach, solving the instance D12,13,<sup>12</sup> takes almost two days of computation with a 128-core machine [23]. In order to prove Theorem 1, we will require to solve an instance roughly 100 times harder, and thus several optimizations will be needed. In fact, we improve on all aspects discussed in Section 2; we present five different forms of optimization that are key to the success of our approach, which we summarize next.


5. We introduce a new and extremely simple kind of clauses called alod clauses, which improve performance when added to the other clauses of any encoding we have tested.

The following subsections present each of these components in detail.

### 3.1 "Plus": a New Encoding

Despite the asymptotic improvement of the recursive encoding of Subercaseaux and Heule [23], its contribution is mostly of "theoretical interest" as it does not improve solution times. Nonetheless, that encoding suggests the possibility of finding one that is both more succinct than the direct encoding and that speed-ups computation. Our path towards such an encoding starts with Bounded Variable Addition (BVA) [16], a technique to automatically re-encode CNF formulas by adding new variables, with the goal of minimizing their resulting size (measured as the sum of the number of variables and the number of clauses). BVA can significantly reduce the size of Dr,k,c instances, even further than the recursive encoding. Moreover, BVA actually speeds-up computation when solving the resulting instances with a CDCL solver, see Table 2. Figure 3 compares the number of amod clauses between the direct encoding and the BVA encoding; for example in the direct encoding, for D<sup>14</sup> color 10 would require roughly 30000 clauses, whereas it requires roughly 3500 in the BVA encoding. It can be observed as well in Figure 3 that the direct encoding grows in a very structured and predictable way, where color c in D<sup>r</sup> requires roughly r<sup>2</sup>c<sup>2</sup> clauses. On the other hand, arguably because of its locally greedy nature, the results for BVA are far more erratic, and roughly follow a 4r<sup>2</sup> lg c curve.

The encoding resulting from BVA does not perform particularly well when coupled with the split algorithm of Subercaseaux and Heule. Indeed, Table 2 shows that while BVA heavily improves runtime under sequential CDCL, it does not provide a meaningful advantage when using Cube And Conquer. Furthermore, encodings resulting from BVA are hardly interpretable, as BVA uses

Fig. 3: Comparison of the size of the at-most-one-color clauses between the direct encoding and the BVA-encoding, for D<sup>4</sup> up to D<sup>14</sup> and colors {4,..., 10}.


Table 2: Comparison between the different encodings. Cube And Conquer experiments were performed with the approach of Subercaseaux and Heule [23]

a locally greedy strategy for introducing new variables. As a result, the design of a split algorithm that could work well with BVA is a very complicated task. Therefore, our approach consisted of reverse engineering what BVA was doing over some example instances, and using that insight to design a new encoding that produces instances of size comparable to those generated by BVA while being easily interpretable and thus compatible with natural split algorithms.

By manually inspecting BVA encodings one can deduce that a fundamental part of their structure is what we call regional variables/clauses. A regional variable rS,c is associated with a set of vertices S and a color c, meaning that at least one vertex in S receives color c. Let us illustrate their use with an example.

Example 2. Consider the instance D6,11, and let us focus on the at-most-onedistance (amod) clauses for color 4. Figure 4a depicts two regional clauses: one in orange (vertices labeled with α), and one in blue (vertices labeled with β), each consisting of 5 vertices organized in a plus (+) shape. We thus introduce variables rorange,<sup>4</sup> and rblue,4, defined by the following clauses:


The benefit of introducing these two new variables and 2 + (5 · 2) = 12 additional clauses will be shown now, when using them to forbid conflicts more compactly. Indeed, each vertex labeled with α or β participates in |D4| − 1 = 40 amod clauses in the direct encoding, which equals a total of 10 · <sup>40</sup> <sup>−</sup> <sup>10</sup> 2 = 355 clauses for all of them (subtracting the clauses counted twice). However, note that all 36 vertices shaded in light orange are at distance at most 4 from all vertices labeled with α, and thus they are in conflict with rorange,4. This means that we can encode all conflicts between α-vertices and orange-shaded vertices with 36 clauses. The same can be done for β-vertices and the 36 vertices shaded in light blue. Moreover, all pairs of vertices (x, y) with x being an α-vertex and y being a β-vertex are in conflict, which we can represent simply with the clause (rorange,<sup>4</sup> ∨ rblue,4), instead of 5 · 5 = 25 pairwise clauses. We still need,

(a) Illustration of regions interacting in P6,11,6, for color 4.

(b) Illustration of the placement of regions of the 13 regions in P6,11,6.

Fig. 4: Illustrations for P6,11,6.

however, to forbid that more than one α-vertex receives color 4, and the same for β-vertices, which can be done by simply adding all 2 · 5 2 = 20 amod clauses between all pairs. In total, the total number of clauses involving α or β vertices has gone down to 12+ 2· 36+ 20+ 1 = 105 clauses, from the original 355 clauses, by merely adding two new variables.

As shown in Example 2, the use of regional clauses can make encodings more compact, and this same idea scales even better for larger instances when the regions are larger. A key challenge for designing a regional encoding in this manner is that it requires a choice of regions (which can even be different for every color). After trying several different strategies for defining regions, we found one that works particularly well in practice (despite not yielding an optimal number for the metric #variables + #clauses), which we denote the plus encoding. The plus encoding is based on simply using "+" shaped regions (i.e., D1) for all colors greater than 3, and to not introduce any changes for colors 1, 2 and 3 as they only amount to a very small fraction of the total size of the instances we consider. We denote with Pd,k,c the plus encoding of the diamond of size d with k colors, and the centered being colored with c. Figure 4b illustrates P6,11,6. Interestingly, the BVA encoding opted for larger regions for the larger colors, using for example D2's or D3's as regions for color 14. We have experimentally found this to be very ineffective when coupled with our split algorithms. In terms of the locations of the "+" shaped regions, we have placed them manually through an interactive program, arriving to the conclusion that the best choice of locations consists of packing as many regions as possible and as densely around the center as possible. A more formal presentation of all the clauses involved in the plus encoding is presented in the extended arXiv version [24] of this paper, but all its components have been illustrated in Example 2.

The exact number of clauses resulting from the plus encoding is hard to analyze precisely, but it is clear that asymptotically it only improves from the direct encoding by a constant multiplicative factor. Figure 3 and Table 2 illustrate the compactness of the plus encoding over particular instances, and its increase in efficiency both for CDCL solving as well as with the Cube And Conquer approach of Subercaseaux and Heule [23].

#### 3.2 Symmetry Breaking

Another improvement of our approach is a static symmetry-breaking technique, while Subercaseaux and Heule [23] achieved symmetry breaking by discarding all but <sup>1</sup>/<sup>8</sup> of the cubes. We cannot do this easily since the plus encoding does not have an 8-fold symmetry. Instead it has a 4-fold symmetry (see Figure 4b). We add symmetry breaking clauses directly on top of the direct encoding (i.e., instead of using it after a Cube And Conquer split), as Dr,k,c has indeed an 8-fold symmetry (see Figure 5b). Concretely, if we consider a color t, it can only appear once in the Dt/<sup>2</sup>, as if it appeared more than once said appearances would be at distance ≤ t. Given this, we can assume without loss of generality that if there is one appearance of t in Dt/<sup>2</sup>, then it appears with coordinates (a, b) such that a ≥ 0 ∧ b ≥ a. We enforce this by adding negative units of the form x(i,j),t for every pair (i, j) ∈ Dt/<sup>2</sup> such that i < 0 ∨ j<i. This is illustrated in Figure 5b for D5,10. Note however that this can only be applied to a single color t, as when a vertex in the north-north-east octant gets assigned color t, the 8-fold symmetry is broken. However, if the symmetry breaking clauses have been added for color t, and yet t does not appear in Dt/<sup>2</sup>, then there is still an 8-fold symmetry in the encoding we can exploit by breaking symmetry on some other color t . This way, our encoding uses L = 5 layers of symmetry breaking, for colors k, k − 1,...,k − L + 1. At each layer i, where symmetry breaking is done over color k − i, except for the first (i.e., i > 0), we need to concatenate a clause

$$\text{SymometryBroken}\_i := \bigvee\_{t=k-i}^k \bigvee\_{\substack{(a,b)\in D\_{\lfloor t/2 \rfloor} \\ 0\le a\le b}} x\_{(a,b),t}$$

to each symmetry breaking clause, so that symmetry breaking is applied only when symmetry has not been broken already. Table 3 (page 14) illustrates the impact of this symmetry breaking approach, yielding close to a ×40 speed-up for D6,11,6.

#### 3.3 At-Least-One-Distance clauses

Yet another addition to our encoding is what we call At-Least-One-Distance (alod) clauses, which consist on stating that, for every vertex v, if we consider D1(v), then at least one vertex in D1(v) must get color 1. Concretely, the At-Least-One-Distance clause corresponding to a vertex v = (i, j) is

$$C\_v = x\_{(i,j),1} \lor x\_{(i+1,j),1} \lor x\_{(i-1,j),1} \lor x\_{(i,j+1),1} \lor x\_{(i,j-1),1} \cdot \bar{x}\_{(i,j-1),1}$$

(a) Illustration of the effect of adding alod clauses. The right figure, with alod clauses, presents a chessboard pattern.

Fig. 5: The effect of adding alod clauses (left) and symmetry-breaking (right).

Note that adding these clauses preserves satisfiability since they are blocked clauses [15]; this can be seen as follows. If no vertex in D1(v) gets assigned color 1, then we can simply assign xv,1, thus satisfying the new clause Cv.

The purpose of alod clauses can be described as incentives towards assigning color 1 in a chessboard pattern (see Figure 5a), which seems to simplify the rest of the computation. Empirically, their addition improves runtimes; see Table 3.

#### 3.4 Cube And Conquer Using Auxiliary Variables

The split of Subercaseaux and Heule [23] is based on cases about the xv,c variables of the direct encoding, and specifically using vertices v that are close to the center and colors c that are in the top-t colors for some parameter t.

Our algorithm is instead based on cases only around the new regional variables rS,c, which appears to be key for exploiting their use in the encoding.

More concretely, our algorithm, which we call ptr, is roughly based on splitting the instance into cases according to which out of the R regions that are closest to the center get which of the T highest colors (noting that a region can get multiple colors). A third parameter P indicates the maximum number of positive literals in any cube of the split. More precisely, there are cubes with i positive literals for i ∈ {0, 1,...,P − 1, P}, and the set of cubes with i positive literals is constructed by ptr as follows:


Lemma 1. The cubes generated by the ptr algorithm form a tautology.

The proof of Lemma 1 is quite simple, and we refer the reader to the proof of Lemma 7 in Subercaseaux and Heule [23] for a very similar one. Moreover, because our goal is to have a verifiable proof, instead of relying on Lemma 1, we test explicitly that the cubes generated by our algorithm form a tautology in all the instances mentioned in this paper. Pseudo-code for ptr is presented in the extended arXiv version of this paper [24].

#### 3.5 Optimizing the Center Color

Our previous work [23] argued that for an instance Dr,k, one should fix the color of the central vertex to min(r, k). However, our experiments suggest otherwise. As the proof of Lemma 2 (in extended arXiv version [24]) implies, we are allowed to fix any color in the center, and as long as the resulting instance is unsatisfiable, that will allow us to establish the same lower bound. It turns out that the choice of the center color can dramatically affect performance, as shown for instance <sup>D</sup>12,<sup>13</sup> (the one used to prove <sup>χ</sup>ρ(Z<sup>2</sup>) <sup>≥</sup> 14) in Figure 6. Interestingly, performance does not change monotonically with the value fixed in the center. Intuitively, it appears that fixing smaller colors in the center is ineffective as they impose restrictions on a small region around the center, while fixing very large colors in the center does not constrain the center much; for example, on the one hand, fixing a 1 or 2 in the center does not seem to impose any serious constraints on solutions. On the other hand, when a 12 is fixed in the center (as in our previous work [23]), color 6 can be used 5 times in D6, whereas if color 6 is fixed in the center, it can only be used once in D6. The apparent advantage of fixing 12 in the center (that it cannot occur anywhere else in D12,13), is outweighed by the extra constraints around the center that fixing color 6 imposes; Subercaseaux and Heule already observed that most conflicts between colors occur around the center [23]), thus explaining why it makes sense to optimize in that area.

The main result of Subercaseaux and Heule [23] is the unsatisfiability of D12,13,12, which required 45 CPU hours using the same SAT solver and similar hardware. Let P d,k,cdenote Pd,k,c with alod clauses and symmetry-breaking

Fig. 6: The impact of the color in the center (c) on the performance for P 12,13,c.

400 B. Subercaseaux and M. J. H. Heule

$$\overbrace{D\_{15,14,6}}^{\text{symmetry proof}} \equiv \underbrace{D\_{15,14,6}^{\star} \equiv \overbrace{P\_{15,14,6}^{\star}}^{\text{implicit proof}}}\_{\text{ro-encoding proof}} \equiv \underbrace{N\_{15,14,6} \models \bot}\_{\text{tautology proof}}$$

Fig. 7: Illustration of the verification pipeline.

predicates. We show unsatisfiability of P <sup>12</sup>,13,<sup>12</sup> in 1.18 CPU hours and of P 12,13,6 in 0.34 CPU hours. So the combination of the plus encoding and the improved center reduces the computational costs by two orders of magnitude.

## 4 Verification

Our pipeline proves that, in order to trust χρ(Z<sup>2</sup>) = 15 as a result, the only component that requires unverified trust is the direct encoding of D15,14,6. Indeed, let P <sup>15</sup>,14,<sup>6</sup> be the instance P15,14,<sup>6</sup> with alod-clauses and 5 layers of symmetry breaking clauses, and let ψ = {c1,...,cm} be the set of cubes generated by the ptr algorithm with parameters P = 6, T = 7, R = 9. We then prove:


As a result, Theorem 1 relies only on our implementation of D15,14,6. Fortunately, this is quite simple, and the whole implementation is presented in the extended arXiv version of this paper [24]. Figure 7 illustrates the verification pipeline, and the following paragraphs detail its different components.

Symmetry Proof. The first part of the proof consists in the addition of symmetry-breaking predicates to the formula. This part needs to go before the re-encoding proof, because the plus encoding does not have the 8-fold symmetry of the direct encoding. Each of the clauses in the symmetry-breaking predicates have the substitution redundancy (SR) property [5]. This is a very strong redundancy property and checking whether a clause C has SR w.r.t. a formula ϕ is NP-complete. However, since we know the symmetry, it is easy to compute a SR certificate. There exists no SR proof checker. Instead, we implemented a prototype tool to convert SR proofs into DRAT for which formally verified checkers exists. Our conversion is similar to the approach to converted propagation redundancy into DRAT [12]. The conversion can significantly increase the size of the proof, but the other proof parts are typically larger for harder formulas, thus the size is acceptable.

Re-encoding Proof. After symmetry breaking, the formula encoding is optimized by transforming the direct encoding into the plus encoding and adding the alod clauses. This part of the proof is easy. All clauses in the plus encoding and all alod clauses have the RAT redundancy property w.r.t. the direct encoding. This means that we can add all these clauses with a single addition step per clause. Afterward, the clauses that occur in the direct encoding but not in the plus encoding are removed using deletion steps.

Implication Proof. The third part of the proof expresses that the formula cannot be satisfied with any of the cubes from the split. For easy problems, one can avoid splitting and just use the empty cube as tautological DNF. For harder problems, splitting is crucial. We solve D15,14,<sup>6</sup> using a split with just over 5 million cubes. Using a SAT solver to show that the formula with a cube is unsatisfiable shows that the negative of the cube is implied by the formula. We can derive all these implied clauses in parallel. The proofs of unsatisfiability can be merged into a single implication proof.

Tautology Proof. The final proof part needs to show that the negation of the clauses derived in the prior steps form a tautology. In most cases, including ours, the cubes are constructed using a tree-based method. This makes the tautology check easy as there exists a resolution proof from the derived clauses to the empty clause using m−1 resolution steps with m denoting the number of cubes. This part can be generated using a simple SAT call.

The final proof merges all the proof parts. In case the proof parts are all in the DRAT format, such as our proof parts, then they can simply be merged by concatenating the proofs using the order presented above.

## 5 Experiments

Experimental Setup. In terms of hardware, all our experiments were run in the Bridges2 [4] supercomputer. Each node has the following specifications: Two AMD EPYC 7742 CPUs, each with 64 cores, 256MB of L3 cache, and 512GB total RAM memory. Our code and various formulas are publicly available at the repository https://github.com/bsubercaseaux/PackingChromaticTacas. In terms of software, all sequential experiments were run on state-of-the-art solver CaDiCaL [2], while parallel experiments with Cube And Conquer were run using a new implementation of parallel iCaDiCaL because it supports incremental solving [13] while being significantly faster than iLingeling.

Effectiveness of the Optimizations. We evaluated the optimizations to the direct encoding as proposed in Section 3: the plus encoding, the addition of the alod clauses, and the new symmetry breaking. The results are shown in Table 3. We picked D6,11,<sup>6</sup> for this evaluation since it is the largest diamond that can still be solved within a couple of hours on a single core.

The main conclusion is that the optimizations significantly improve the runtime. A comparison between the direct encoding without symmetry breaking and the plus encoding with symmetry breaking and the alod clauses shows that the latter can be solved roughly 200x faster. Table 3 shows all 8 possible configurations. Turning on any of the optimizations always improves performance. The effectiveness of the plus encoding and alod clauses is somewhat surprising: the speed-up factor obtained by re-encoding typically does not exceed the factor by which the formula size is reduced. In this case, the reduction factor in formula size is less than 3, while the speed-up is larger than 13 (see the difference between the first and second row of Table 3). Moreover, we are not aware of the effectiveness of adding blocked clauses. Typically SAT solvers remove them.

We also constructed DRAT proofs of the optimizations (shown as derivation in the table) and the solver runtime. We merged them into a single DRAT proof by concatenating the files. The proofs were first checked with the drat-trim tool, which produced LRAT proofs. These LRAT files were validated using the formally-verified cake-lpr checker. The size of the DRAT proofs and the checking time are shown in the table. Note that the checking time for the proofs with symmetry breaking is always larger than the solving times. This is caused by expressing the symmetry breaking in DRAT resulting in a 436 Mb proof part.

The Implication Proof. The largest part of the computation consist of showing that P <sup>15</sup>,4,<sup>6</sup> is unsatisfiable under each of the 5, 217, 031 cubes produced by the cube generator. The results of the experiments are shown in Figure 8 (left). The left plot shows that roughly half of the cubes can be solved in a second or less. The average runtime of cubes was 3.35 seconds, while the hardest cube required 1584.61 seconds. The total runtime was 4851.38 CPU hours.

For each cube, we produced a compressed DRAT proof (the default output of CaDiCaL). Due to the lack of hints in DRAT proofs, they are somewhat complex to validate using a formally-verified checker. Instead, we use the tool drat-trim to trim the proofs and add hints. The result are uncompressed LRAT files, which we validate using the formally-verified checker cake lpr. The verification time was 4336.93 CPU hours, so slightly less than the total runtime.

The sizes of each of the implication proofs show a similar distribution, as depicted in Figure 8 (right). Most proofs are less than 10 MB in size. The


Table 3: Evaluating the effectiveness of the optimizations on D6,11,6.

Fig. 8: Cactus plot of solving and verification times in seconds (left) and cactus plot of the size of the compressed DRAT proof and uncompressed LRAT proof in Mb (right).

compressed DRAT proofs are generally smaller compared to the LRAT proofs, but that is mostly due to compression, which reduces the size by around 70%.

The Chessboard Conjecture and its Counterexample. Given that color 1 can be used to fill in <sup>1</sup>/<sup>2</sup> of Z<sup>2</sup> in a packing coloring, and the packing colorings found in the past, with 15, 16 or 17 colors used color 1 with density <sup>1</sup>/<sup>2</sup> in a chessboard pattern [18], it is tempting to assume that this must always be the case. This way, we conjectured that any instance Dr,k,c is satisfiable if and only if it is with the chessboard pattern. The consequence of the conjecture is significant, as if it were true we could fix half of the vertices to color 1, thus massively reducing the size of the instance and its runtime. Unfortunately, this conjecture happens to be false, with the smallest counterexample being D14,14,<sup>6</sup> as illustrated in Figure 9, which deviates from the chessboard pattern in only 2 vertices. We have proved as well that no solution for D14,14,<sup>6</sup> deviating in only 1 vertex from the chessboard pattern exists.

Proving the Lower Bound. In order to prove Theorem 1, we require the following 3 lemmas, from where the conclusion easily follows.

Lemma 2. If <sup>D</sup>15,14,<sup>6</sup> is unsatisfiable, then <sup>χ</sup>ρ(Z<sup>2</sup>) <sup>≥</sup> <sup>15</sup>.

Lemma 3. If D15,14,<sup>6</sup> is satisfiable, then P <sup>15</sup>,14,<sup>6</sup> is also satisfiable.

Lemma 4. P <sup>15</sup>,14,<sup>6</sup> is unsatisfiable.

We have obtained computational proofs of Lemma 3 and Lemma 4 as described above, and thus it only remains to prove Lemma 2, which we include in the appendix. We can thus proceed to our main proof.

Proof (of Theorem 1). Since Martin et al. proved that <sup>χ</sup>ρ(Z<sup>2</sup>) <sup>≤</sup> 15 [18], it remains to show <sup>χ</sup>ρ(Z<sup>2</sup>) <sup>≥</sup> 15, which by Lemma <sup>2</sup> reduces to proving Lemma <sup>3</sup> and Lemma 4. We have proved these lemmas computationally, obtaining a single DRAT proof as described in Section 4. The total solving time was 4851.31 CPU hours, while the total checking time of the proofs was 4336.93 CPU hours. The total size of the compressed DRAT proof is 34 terabytes, while the uncompressed LRAT proof weighs 122 terabytes.

Fig. 9: A valid coloring of D14,14,6. No valid coloring exists for this grid with a full chessboard pattern of 1's.

## 6 Concluding Remarks and Future Work

We have proved χρ(Z<sup>2</sup>) = 15 by using several SAT-solving techniques, in what constitutes a new success story for automated reasoning tools applied to combinatorial problems. Moreover, we believe that several of our contributions in this work might be applicable to other settings and problems. Indeed, we have obtained a better encoding by reverse engineering BVA, and designed a split algorithm that works well coupled with the new encoding; this experience suggests the split-encoding compatibility as a new key variable to pay attention to when solving combinatorial problems under the Cube And Conquer paradigm. As for future work, it is natural to study whether our techniques can be used to improve other known bounds in the packing-coloring area (see e.g., [3]), as well as to other families of coloring problems, such as distance colorings [14].

Acknowledgements We thank the Pittsburgh Supercomputing Center for allowing us to use Bridges2 [4] in our experiments. We thank as well the anonymous reviewers for their comments and suggestions. We also thank Donald Knuth for his thorough comments and suggestions. The first author thanks the Facebook group "actually good math problems", from where he first learned about this problem, and in particular to Dylan Pizzo for his post about this problem.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Active Learning for SAT Solver Benchmarking

Tobias Fuchs() , Jakob Bach , and Markus Iser

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany info@tobiasfuchs.de, {jakob.bach,markus.iser}@kit.edu

Abstract. Benchmarking is a crucial phase when developing algorithms. This also applies to solvers for the SAT (propositional satisfiability) problem. Benchmark selection is about choosing representative problem instances that reliably discriminate solvers based on their runtime. In this paper, we present a dynamic benchmark selection approach based on active learning. Our approach predicts the rank of a new solver among its competitors with minimum runtime and maximum rank prediction accuracy. We evaluated this approach on the Anniversary Track dataset from the 2022 SAT Competition. Our selection approach can predict the rank of a new solver after about 10 % of the time it would take to run the solver on all instances of this dataset, with a prediction accuracy of about 92 %. We also discuss the importance of instance families in the selection process. Overall, our tool provides a reliable way for solver engineers to determine a new solver's performance efficiently.

Keywords: Propositional satisfiability · Benchmarking · Active learning

## 1 Introduction

One of the main phases of algorithm engineering is benchmarking. This also applies to propositional satisfiability (SAT), the archetypal N P-complete problem. Benchmarking is, however, quite expensive regarding the runtime of experiments. While benchmarking a single SAT solver might still be feasible, developing new, competitive SAT solvers requires extensive experimentation with a variety of ideas [8,2]. In particular, a new solver idea is rarely best on the first try. Thus, it is highly desirable to reduce benchmarking time and discard unpromising ideas early, allowing to test more approaches or spend more time on promising ones. The field of SAT solver benchmarking is well established, but traditional benchmark selection approaches do not optimize benchmark runtime. Instead, they focus on selecting a representative set of instances for scoring solvers [10,15]. For the latter, SAT Competitions typically employ the PAR-2 score, i.e., the average runtime with a penalty of 2τ for timeouts with time-limit τ [8].

In this paper, we present a novel benchmark selection approach based on active learning. Our approach can predict the rank of a new solver with high accuracy in only a fraction of the time needed to evaluate the complete benchmark. Definition 1 specifies the problem we address.

 c The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 407–425, 2023 https://doi.org/10.1007/978-3-031-30823-9\_21

Definition 1 (New-Solver Problem). Given solvers A, instances I, runtimes r : A×I → [0, τ ] with time-limit τ , and a new solver a / ˆ ∈ A, incrementally select benchmark instances from I to maximize the confidence in predicting the rank of aˆ while minimizing the total benchmark runtime.

Note that our scenario assumes knowing the runtimes of all solvers, except the new one, on all instances. One could also imagine a collaborative filtering scenario, where runtimes are only partially known [23,25].

Our approach satisfies several desirable criteria for benchmarking: Rather than outputting a binary classification, i.e., whether the new solver is worse than an existing solver or not, we provide a scoring function that shows by which margin a solver is worse and how similar it is to existing solvers. In particular, our approach enables ranking the new solver amidst a set of existing solvers. For this ranking, we do not even need to predict exact solver runtimes, which is trickier. Further, we optimize the runtime that our strategy needs to arrive at its conclusion. We use instance and runtime features. Moreover, we select instances non-randomly and incrementally. In particular, we consider runtime information from already done experiments when choosing the next. By doing so, we can control the properties of the benchmarking approach, such as its required runtime. Our approach is scalable in that it ranks a new solver aˆ among any number of known solvers A. In particular, we only subsample the benchmark once instead of comparing pairwise against each other solver [21].

We evaluate our approach with the SAT Competition 2022 Anniversary Track dataset [2], consisting of 5355 instances and runtimes of 28 solvers. We perform cross-validation by treating each solver once as the new solver and learning to predict the PAR-2 rank of that solver. On average, our predictions reach about 92 % accuracy with only about 10 % of the runtime required to evaluate these solvers on the complete set of instances.

Our entire source code<sup>1</sup> and experimental data<sup>2</sup> are available on GitHub.

## 2 Related Work

Benchmarking is not only of high interest in many fields but also an active research area on its own. Recent studies show that benchmark selection is challenging for multiple reasons. Biased benchmarks can easily lead to fallacious interpretations [7]. Benchmarking also has many interchangeable parts, such as the performance measures used, how measurement points are aggregated, and how missing values are handled. Questionable research practices could alter these elements a-posteriori to meet expectations, thereby skewing the results [27]. In the following, we discuss related work from the areas of static benchmark selection, algorithm configuration, incremental benchmark selection, and active learning. Table 1 compares the most relevant approaches, which all pursue slightly different goals. Thus, our approach is not a general improvement over the others but the only one fully aligned with Definition 1.

<sup>1</sup> https://github.com/mathefuchs/al-for-sat-solver-benchmarking

<sup>2</sup> https://github.com/mathefuchs/al-for-sat-solver-benchmarking-data


Table 1: Comparison of features of our benchmark-selection approach, the static benchmark-selection approach by Hoos et al. [15], the algorithm configuration system SMAC [16], and the active-learning approaches by Matricon et al. [21].

Static Benchmark Selection. Benchmark selection is essential for competitions, e.g., the SAT Competition. In such competitions, the organizers define the rules for composing the benchmarks. These selection strategies are primarily static, i.e., they do not depend on particular solvers to distinguish. Balint et al. provide an overview of benchmark-selection criteria in different solver competitions [1]. Froleyks et al. describe benchmark selection in recent SAT competitions [8]. Manthey and Möhle find that competition benchmarks might contain redundant instances and propose a feature-based approach to remove redundancy [20]. Mısır presents a feature-based approach to reduce benchmarks by matrix factorization and clustering [24].

Hoos et al. [15] discuss which properties are most desirable when selecting SAT benchmark instances. The selection criteria are instance variety to avoid over-fitting, adapted instance hardness (not too easy but also not too hard), and avoiding duplicate instances. To filter too similar instances, they use a distancebased approach with the SATzilla features [37,38]. The approach does, however, not optimize for benchmark runtime and selects instances randomly, apart from constraints on the instance hardness and feature distance.

Algorithm Configuration. Further related work can be found within the field of algorithm configuration [14,32], e.g., the configuration system SMAC [16]. Thereby, the goal is to tune SAT solvers for a given sub-domain of problem instances. Although this task is different from our goal, e.g., we do not need to navigate the configuration space, there are similarities to our approach as well. For example, SMAC also employs an iterative, model-based selection procedure, though for configurations rather than instances. An algorithm configurator, however, cannot be used to rank/score a new solver since algorithm configuration solemnly seeks to find the best-performing configuration. Also, while using a model-based selection strategy to sample configurations, instance selection is made randomly, i.e., without building a model over instances.

Incremental Benchmark Selection. Matricon et al. present an incremental benchmark selection approach [21]. Their per-set efficient algorithm selection problem (PSEAS) is similar to our New-Solver Problem (cf. Definition 1). Given a pair of SAT solvers, they iteratively select a subset of instances until the

Fig. 1: Types of machine learning (depiction inspired by Rubens et.al. [29]).

desired confidence level is reached to decide which of the two solvers is better. The selection of instances depends on the choice of the solvers to distinguish. They calculate a scoring metric for all unselected instances, run the experiment with the highest score, and update the confidence. Their approach ticks off most of our desired features in Table 1. However, the approach only compares solvers binarily rather than providing a scoring. Thus, it is unclear how similar two given solvers are or on which instances they behave similarly. Moreover, a significant shortcoming is the lacking scalability with the number of solvers. Comparing only pairs of solvers, evaluating a new solver requires sampling a separate benchmark for each existing solver. In contrast, our approach allows comparing a new solver against a set of existing solvers by sampling only one benchmark.

Active Learning. Prediction models in passive machine learning are trained on datasets with given instance labels (cf. Fig. 1a). In contrast, active learning (AL) starts with no or little labeled data. It repeatedly selects interesting problem instances for which to acquire labels, aiming to gradually improve the prediction model (cf. Fig. 1b). AL methods are especially beneficial if acquiring labels is computationally expensive, like obtaining solver runtimes. Without AL methods, it is not obvious which instances to label and which not. On the one hand, we want to maximize the utility an instance provides to our model, i.e., rank prediction accuracy, and on the other hand, minimize the cost, i.e., predicted runtime, associated with the instance's acquisition. Thus, we strive for an accurate prediction model without having to label every data point.

Rubens et. al. [29] survey active-learning advances. While synthesis-based AL methods [5,9,34] generate instances for labeling, pool-based methods [11,13,19] rely on a fixed set of unlabeled instances to sample from. Recent synthesis-based methods within the field of SAT solving show how to generate problem instances with desired properties [5,9]. This goal is, however, orthogonal to ours. While those approaches want to generate instances on which a solver is good or bad, we want to predict whether a solver is good or bad on an existing benchmark. Volpato and Guangyan use pool-based AL to learn an instance-specific algorithm selector [35]. Rather than benchmarking a solver's overall performance, their goal is to recommend the best solver out of a set of solvers for each SAT instance.


## 3 Active Learning for SAT Solver Benchmarking

Algorithm 1 outlines our benchmarking framework. Given a set of solvers A, instances I and runtimes r, we first initialize a prediction model M for the new solver aˆ ∈ A (Line 1). The prediction model M is used to repeatedly select an instance (Line 4) for benchmarking aˆ (Line 5). The acquired result is subsequently used to update the prediction model M (Line 7). When the stopping criterion is met (Line 3), we quit the benchmarking loop and predict the final score of aˆ (Line 8). Algorithm 1 returns the predicted score of aˆ as well as the acquired instances and runtime measurements (Line 9).

Section 3.1 describes the underlying prediction model M and specifies how we may derive a solver ranking from it. We discuss criteria for selecting instances in Section 3.2. Section 3.3 concludes with possible stopping conditions.

#### 3.1 Solver Model

The model <sup>M</sup> provides a runtime-label prediction function <sup>f</sup> : A×I → <sup>ˆ</sup> <sup>R</sup> for all solvers <sup>A</sup><sup>ˆ</sup> := A∪{aˆ}. This prediction function powers instance selection as described in Section 3.2. During model updates (Algorithm 1, Line 7), f is trained to predict a transformed version of the acquired runtimes R. We describe the runtime transformation in the subsequent section. The features described in Section 4.2 serve as the input to the model. Further, note that we build a new prediction model in each iteration since running experiments (Line 5) dominates the runtime of model training by magnitudes. Finally, we predict the score of the new solver aˆ with the prediction function f (Line 8).

Runtime Transformation. For the prediction model M, we transform the real-valued runtimes into discrete runtime labels on a per-instance basis. For each instance e ∈ I, we use a clustering algorithm to assign the runtimes in <sup>r</sup>(a, e) <sup>|</sup> <sup>a</sup> ∈ A to one of <sup>k</sup> clusters <sup>C</sup>1,...,C<sup>k</sup> such that the fastest runtimes for the instance e are in cluster C<sup>1</sup> and the slowest are in cluster Ck−1. Timeouts τ always form a separate cluster Ck. The runtime transformation function γ<sup>k</sup> : A×I→{1,...,k} is then specified as follows:

$$
\gamma\_k(a, e) = j \iff r(a, e) \in C\_j,
$$

Given an instance e ∈ I, a solver a ∈ A belongs to the γk(a, e)-fastest solvers on instance e. In preliminary experiments, we achieved higher accuracy for predicting such discrete runtime labels than for predicting raw runtimes. Research on portfolio solvers has also shown that discretization works well in practice [4,26].

Ranking Solvers. To determine solver ranks, we use the transformed runtimes γk(a, e) in the adapted scoring function s<sup>k</sup> : A → [1, 2 · k] as follows:

$$s\_k(a) := \frac{1}{|\mathcal{Z}|} \sum\_{e \in \mathcal{Z}} \gamma\_k'(a, e) \qquad \gamma\_k'(a, e) := \begin{cases} 2 \cdot \gamma\_k(a, e) & \text{if } \gamma\_k(a, e) = k \\ \gamma\_k(a, e) & \text{otherwise} \end{cases} \tag{1}$$

I.e., we apply PAR-2 scoring, which is commonly used in SAT competitions [8], on the discrete labels. The scoring function s<sup>k</sup> induces a ranking among solvers.

#### 3.2 Instance Selection

Selecting an instance based on the model is a core functionality of our framework (cf. Algorithm 1, Line 4). In this section, we introduce two instance sampling strategies, one that minimizes uncertainty and one that maximizes information gain. Both strategies use the model's label-prediction function f and are inspired by existing work within the realms of active learning [30]. These methods require the model's predictions to include probabilities for the k discrete runtime labels. Let <sup>f</sup> : A×I→ <sup>ˆ</sup> [0, 1]<sup>k</sup> denote this modified prediction function. In the following, the set I⊆I ˜ denotes the instances that have already been sampled.

Uncertainty Sampling. The uncertainty sampling strategy selects the instance closest to the model's decision boundary, i.e., we select the instance <sup>e</sup> ∈I\ <sup>I</sup>˜ that minimizes <sup>U</sup>(e), which is specified as follows:

$$\mathcal{U}(e) := \left| \frac{1}{k} - \max\_{n \in \{1, \dots, k\}} f'(\hat{a}, e)\_n \right|.$$

Information-Gain Sampling. The information-gain sampling strategy selects the instance with the highest expected entropy reduction regarding the runtime labels of the instance. To be more specific, we select the instance <sup>e</sup> ∈I\ <sup>I</sup>˜ that maximizes IG(e), which is specified as follows:

$$\text{IG}(e) := \text{H}(e) - \sum\_{n=1}^{k} f'(\hat{a}, e)\_n \hat{\text{H}}\_n(e)$$

Here, H(e) denotes the entropy of the runtime labels γ(a, e) over all a ∈ A and H(e, n) denotes the entropy of these labels plus n as the runtime label for aˆ. The term Hˆ <sup>n</sup>(e) is computed for every possible runtime label <sup>n</sup> ∈ {1,...,k}. By maximizing information gain, we select instances that identify solvers with similar behavior.

#### 3.3 Stopping Criteria

In this section, we present the two dynamic stopping criteria in our experiments, the Wilcoxon and the ranking stopping criterion (cf. Algorithm 1, Line 3).

Wilcoxon Stopping Criterion. The Wilcoxon stopping criterion stops the active-learning process when we are confident enough that the predicted runtime labels of the new solver are sufficiently different from existing solvers. This criterion is loosely inspired by Matricon et. al. [21]. We use the average p-value Wa<sup>ˆ</sup> of a Wilcoxon signed-rank test w(S, P) of the two runtime label distributions S = {γ(a, e) | e ∈ I} for an existing solver a and P = {f(ˆa, e) | e ∈ I} for the new solver aˆ:

$$W\_{\hat{a}} := \frac{1}{|\mathcal{A}|} \sum\_{a \in \mathcal{A}} \mathbf{w}(S, P)$$

To improve the stability of this criterion, we use an exponential moving average to smooth out outliers and stop as soon as W(i) exp drops below a fixed threshold:

$$\begin{aligned} W\_{\text{exp}}^{(0)} &:= 1\\ W\_{\text{exp}}^{(i)} &:= \beta W\_{\hat{a}} + (1 - \beta) \, W\_{\text{exp}}^{(i - 1)} \end{aligned}$$

Ranking Stopping Criterion. The ranking stopping criterion is less sophisticated in comparison. It stops the active-learning process if the ranking induced by the model's predictions (Equation 1) remained unchanged within the last l iterations. However, the concrete values of the predicted score sa<sup>ˆ</sup> might still change. We are solemnly interested in the induced ranking in this case.

## 4 Experimental Design

Given all the previously presented instantiations for Algorithm 1, this section outlines our experimental design, including our evaluation framework, used data sets, hyper-parameter choices, and implementation details.

#### 4.1 Evaluation Framework

As stated in the Introduction, this work addresses the New-Solver Problem (cf. Definition 1). As described in Section 3.1, a prediction model M provides us with an estimated scoring sa<sup>ˆ</sup> for the new solver aˆ.

#### Algorithm 2: Evaluation Framework

Input: Solvers A, Instances I, Runtimes r : A×I→ [0, τ ] Output: Average Ranking Accuracy O¯acc, Average Fraction of Runtime O¯rt <sup>1</sup> O ← ∅ <sup>2</sup> for aˆ ∈ A do <sup>3</sup> A ←A\{aˆ} <sup>4</sup> (saˆ, R) ← runALAlgorithm(A , I, r, aˆ) // Refer to Algorithm 1 // Determine Ranking Accuracy <sup>5</sup> Oacc ← 0 <sup>6</sup> for a ∈ A do <sup>7</sup> if sk(a) − sa<sup>ˆ</sup> · par2(a) − par2(ˆa) > 0 then <sup>8</sup> <sup>O</sup>acc <sup>←</sup> <sup>O</sup>acc <sup>+</sup> <sup>1</sup> |A| // Determine Runtime Fraction <sup>9</sup> <sup>r</sup> <sup>←</sup> e∈I r(ˆa, e) <sup>10</sup> Ort ← 0 <sup>11</sup> for e ∈ I do <sup>12</sup> if ∃t,(e, t) ∈ R then <sup>13</sup> <sup>O</sup>rt <sup>←</sup> <sup>O</sup>rt <sup>+</sup> <sup>t</sup> r <sup>14</sup> <sup>O</sup> <sup>←</sup> <sup>O</sup> <sup>∪</sup> (Oacc, Ort) 15 O¯acc, O¯rt ← average(O) <sup>16</sup> return O¯acc, O¯rt

To evaluate a concrete instantiation of Algorithm 1, i.e., a concrete choice for all the sub-routines, we perform cross-validation on our set of solvers. Algorithm 2 shows this. That means each solver plays the role of the new solver aˆ once (Line 2). Note that the new solver in each iteration is excluded from the set of solvers A to avoid data leakage (Line 3). After running our active-learning framework for solver aˆ (Line 4), we compute the value of both our optimization goals, i.e., ranking accuracy and runtime. We define the ranking accuracy Oacc ∈ [0, 1] (higher is better) by the fraction of pairs (ˆa, a) for all a ∈ A that are decided correctly regarding the ground-truth scoring par<sup>2</sup> (Lines 5-8). The fraction of runtime that the algorithm needs to arrive at its conclusion is denoted by Ort ∈ [0, 1] (lower is better). This metric puts the runtime summed over the sampled instances in relation to the runtime summed over all instances in the dataset (Lines 9-13). Finally, we compute averages of the output metrics in Line 15 after we have collected all cross-validation results in Line 14. Overall, we want to find an approach that maximizes

$$O\_{\delta} := \delta O\_{\text{acc}} + (1 - \delta) \left(1 - O\_{\text{rt}}\right) \quad , \tag{2}$$

whereby δ ∈ [0, 1] allows for linear weighting between the two optimization goals Oacc and Ort. Plotting the approaches that maximize O<sup>δ</sup> for all δ ∈ [0, 1] on an Ort-Oacc-diagram provides us with a Pareto front of the best approaches for different optimization-goal weightings.

#### 4.2 Data

In our experiments, we work with the dataset of the SAT Competition 2022 Anniversary Track [2]. The dataset consists of 5355 instances with respective runtime data of 28 sequential SAT solvers. We also use a database of 56 instance features<sup>3</sup> from the Global Benchmark Database (GBD) by Iser et al. [17]. They comprise instance size features and node distribution statistics for several graph representations of SAT instances, among others, and are primarily inspired by the SATzilla 2012 features described in [38]. All features are numeric and free of missing values. We drop 10 out of 56 features because of zero variance. Overall, prediction models have access to 46 instance features and 27 runtime features, i.e., excluding the current new solver aˆ.

Additionally, we retrieve instance-family information<sup>4</sup> to evaluate the composition of our sampled benchmarks. Instance families comprise instances from the same application domain, e.g., planning, cryptography, etc., and are a valuable tool for analyzing solver performance.

For hyper-parameter tuning, we randomly sample 10 % of the complete set of 5355 instances with stratification regarding the instances' family. All instance families that are too small, i.e., 10 % of them corresponds to less than one instance, are put into one meta-family for stratification. This tuning dataset allows for a more extensive exploration of the hyper-parameter space.

#### 4.3 Hyper-parameters

Given Algorithm 1, there are several possible instantiations for the three subroutines, i.e., ranking, selection, and stopping. Also, there are different choices for the runtime-label prediction model and runtime discretization. We describe these experimental configurations in the following.

Ranking. Regarding ranking (cf. Section 3.1), we experiment with the following approaches and hyper-parameter values:

	- History size: Consider the latest 1, 10, 20, 30, or 40 predictions within a voting approach for stability. The latest x predictions for each instance vote on the instance's winning label.
	- Fallback threshold: If the difference of scores between the new solver aˆ and another solver drops below 0.01 , 0.05 , or 0.1 , use the partially observed PAR-2 ranking as a tie-breaker.

<sup>3</sup> https://benchmark-database.de/getdatabase/base\_db

<sup>4</sup> https://benchmark-database.de/getdatabase/meta\_db

Selection. For selection (cf. Section 3.2), we experiment with the following methods and hyper-parameter values. Since the potential runtime of experiments is by magnitudes larger than the model's update time, we only consider incrementing our benchmark by one instance at a time rather than using batches, which is also proposed in current active-learning advances [31,34]. A drawback of this is the lack of parallel execution of runtime experiments.

	- Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.
	- Runtime scaling: Whether to normalize uncertainty scores per instance by the average runtime of solvers on it or use the absolute values.
	- Fallback threshold: Use random sampling for the first 0 %, 5 %, 10 %, 15 %, or 20 % of instances to explore the instance space.
	- Runtime scaling: Whether to normalize information-gain scores per instance by the average runtime of solvers on it or use the absolute values.

Stopping. For stopping decisions (cf. Section 3.3), we experiment with the following criteria and hyper-parameter values:

	- Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.
	- Convergence duration: Stop if the predicted ranking stays the same for a number of sampled instances equal to 1 % or 2 % of all instances.
	- Minimum amount: Sample at least 2 %, 8 %, 10 %, or 12 % of instances before applying the criterion.
	- Average of p-values to drop below: 5 %.
	- Exponential-moving average: Incorporate previous significance values by using an EMA with β = 0.1 or β = 0.7.

Prediction model. Our experiments only use one model configuration for runtime-label prediction since an exhaustive grid search would be infeasible. In preliminary experiments, we compared various model types from scikit-learn [28]. In particular, we conducted nested cross-validation, including hyper-parameter tuning, and used Matthews Correlation Coefficient [12,22] to assess the performance for predicting runtime labels. Our final choice is a stacking ensemble [36] of two prediction models, a quadratic-discriminant analysis [33] and a random forest [3]. Both these models can learn non-linear relationships between the instance features and the runtime labels. Stacking means that another prediction model, in our case a simple decision tree, decides which of the two ensemble members makes the prediction on which instance.

Runtime discretization. To define prediction targets, i.e., discrete runtime labels, we use hierarchical clustering with k = 3 and a log-single-link criterion, which produced the most useful labels in preliminary experiments. We denote this adapted solver scoring function with s3. In our chosen hierarchical procedure, each non-timeout runtime starts in a separate interval. We then gradually merge intervals whose single-link logarithmic distance is the smallest until the desired number of partitions is reached. Other clustering approaches that we tried include hierarchical clustering with mean-, median-, and complete-link criterion, as well as k-means and spectral clustering.

To obtain useful labels, we need to ensure that discretized labels still discriminate solvers and align with the actual PAR-2 ranking. We analyzed the ranking induced by s<sup>3</sup> in preliminary experiments with the SAT Competition 2022 Anniversary Track [2]. According to a Wilcoxon-signed-rank test with α = 0.05, 87.83 % of solver pairs have significantly different scores after discretization, only a slight drop compared to 89.95 % before discretization. Further, our ranking approach correctly decides for almost all (about 97.45 %; σ = 3.68 %) solver pairs which solver is faster. In particular, the Spearman correlation of s<sup>3</sup> and PAR-2 ranking is about 0.988, which is very close to the optimal value of 1 [6]. All these results show that discretized runtimes are suitable for our framework.

#### 4.4 Implementation Details

For reproducibility, our source code and data are available on GitHub (cf. footnotes in Section 1). Our code is implemented in Python using scikit-learn [28] for making predictions and gbd-tools [17] for SAT-instance retrieval.

## 5 Evaluation

In this section, we evaluate our active-learning framework. First, we analyze and tune the different sub-routines of our framework on the tuning dataset. Next, we evaluate the best configurations with the full dataset. Finally, we analyze the importance of different instance families to our framework.

#### 5.1 Hyper-Parameter Analysis

Our experiments follow the evaluation framework introduced in Section 4.1. Fig. 2 shows the performance of the approaches from Section 4.3 on Ort-Oaccdiagrams for the hyper-parameter-tuning dataset. Evaluating a particular configuration with Algorithm 2 returns a point (Ort, Oacc). We do not show intermediate results of the active-learning procedure but only the final results after stopping. The plotted lines represent the best-performing configurations per ranking approach (Fig. 2a), selection approach (Fig. 2b), and stopping criterion (Fig. 2c). In particular, we show the Pareto front, i.e., of all configurations that share a particular value of the plotted hyper-parameter, we take the maximum ranking accuracy over all remaining hyper-parameters not displayed in the corresponding plot.

Fig. 2: Ort-Oacc-diagrams comparing different hyper-parameter instantiations of our active-learning framework on the hyper-parameter-tuning dataset. The xaxis shows the ratio of total solver runtime on the sampled instances relative to all instances. The y-axis shows the ranking accuracy (cf. Section 4.1). Each line entails the front of Pareto-optimal configurations for the respective hyperparameter instantiation.

Fig. 3: Scatter plot comparing different instantiations of trade-off parameter δ for our active-learning framework on the hyper-parameter-tuning dataset. The x-axis shows the fraction of runtime Ort of the sample, while the y-axes show the fraction of instances sampled and ranking accuracy, respectively. The color indicates the weighting between different optimization goals δ ∈ [0, 1]. The larger δ, the more we favor accuracy over runtime.

Regarding ranking approaches (Fig. 2a), using the predicted s3-induced runtime-label ranking consistently outperforms the partially observed PAR-2 ranking for each possible value of the trade-off parameter δ. This outcome is expected since selection decisions are not random. For example, we might sample more instances of one family if it benefits discrimination of solvers. While the partially observed PAR-2 score is skewed, the prediction model can account for this.

Regarding the selection approaches (Fig. 2b), uncertainty sampling performs best in most cases. However, information-gain sampling is beneficial if runtime is strongly favored (small δ; runtime fraction less than 5 %). This result aligns with our expectations: Information-gain sampling selects instances that maximize the expected reduction in entropy. This means we sample instances revealing similarities between solvers rather than differences, which helps to build a confident model quickly. However, the method cannot select helpful instances for distinguishing solvers later. Random sampling performs reasonably well but is outperformed by uncertainty sampling in all cases, showing the benefit of actively selecting instances based on a prediction model.

Regarding the stopping criteria (Fig. 2c), the ranking stopping criterion performs most consistently well. If accuracy is strongly favored (very high δ), the Wilcoxon stopping criterion performs better. The subset-size stopping criterion performs reasonably well but does not improve beyond a certain accuracy because of sampling a fixed subset of instances.

Fig. 3a shows an interesting consequence of weighting our optimization goals: If we, on the one hand, desire to get a rough estimate of a solver's performance Table 2: Performance comparison (on the full dataset) of the best-performing active-learning approaches (AL), random sampling of the same runtime fraction with 1000 repetitions (Random), and statically selecting the instances most frequently sampled by active-learning approaches (Most Freq.)


(a) Best-performing AL approach for δ ∈ [0.2, 0.7]


(b) Best-performing AL approach for δ ∈ (0.7, 0.8]

fast (low δ), approaches favor selecting many easy instances. In particular, the fraction of sampled instances is larger than the fraction of runtime. By having many observations, it is easier to build a model. If we, on the other hand, desire to get a good estimate of a solver's performance in a moderate amount of time (high δ), approaches favor selecting few, difficult instances. In particular, the fraction of instances is smaller than the fraction of runtime.

Furthermore, Fig. 3b reveals which values make the most sense for δ. The range δ ∈ [0.2, 0.8], thereby, corresponds to the points with a runtime fraction between 0.03 and 0.22 We consider this region to be most promising, analogous to the elbow method in cluster analysis [18].

#### 5.2 Full-Dataset Evaluation

Having selected the most promising hyper-parameters, we run our active-learning experiments on the complete Anniversary Track dataset (5355 instances). The aforementioned range δ ∈ [0.2, 0.8] only results in two distinct configurations. The best-performing approach for δ ∈ [0.2, 0.7] uses the predicted runtime-label ranking, information-gain sampling, and ranking stopping criterion. It can predict a new solver's PAR-2 ranking with 90.48 % accuracy (Oacc) in only 5.41 % of the full evaluation time (Ort). The best-performing approach for δ ∈ (0.7, 0.8] uses the predicted runtime-label ranking, uncertainty sampling, and ranking stopping criterion. It can predict a new solver's PAR-2 ranking with 92.33 % accuracy (Oacc) in only 10.35 % of the full evaluation time (Ort).

Table 2 shows how both active-learning approaches (column AL) compare against two static baselines: Random samples instances until it reaches roughly the same fraction of runtime as the AL benchmark sets. We repeat sampling 1000 times and report average results. Most Freq. uses a static benchmark set

Fig. 4: Scatter plot showing the importance of different instance families to our framework on the full dataset. The x-axis shows the frequency of instance families in the dataset. The y-axis shows the average frequency of instance families in the samples selected by active learning. The dashed line represents families that occur with the same frequency in the dataset and samples.

consisting of those instances most frequently sampled by our active learning approach. In particular, we consider the average sampling frequency over all solvers and Pareto-optimal active-learning approaches.

Both our AL approaches perform better than random sampling. However, the performance differences are not significant regarding a Wilcoxon signedrank test with α = 0.05 and also depend on the fraction of sampled runtime (cf. Fig. 2b). A clear advantage of our approach is, though, that it indicates when to stop adding further instances, depending on the trade-off parameter δ. While the active-learning results are less strong on the full dataset than on the smaller tuning dataset, they still show the benefit of making benchmark selection dependent on the solvers to distinguish.

A static benchmark using the most frequently AL-sampled instances performs poorly, though, compared to active learning and random sampling. This outcome is somewhat expected since the static benchmark does not reflect the right balance of instance families: Families whose instances are uniform-randomly selected by AL, e.g., for different solvers, appear less often in this benchmark than families where some instances are sampled more often than others.

#### 5.3 Instance-Family Importance

Selection decisions of our approach also reveal the importance of different instance families to our framework. Fig. 4 shows the occurrence of instance families within the dataset and the benchmarks created by active learning. We use the best-performing configurations for all δ ∈ [0, 1] and examine the selection decisions by the active-learning approach on the SAT Competition 2022 Anniversary Track dataset [2]. While most families appear with the same fraction in the dataset and the sampled benchmarks, a few outliers need further discussion. Problem instances of the families fpga, quasigroup-completion, and planning are especially helpful to our framework in distinguishing solvers. Instances of these families are selected over-proportionally in comparison to the full dataset. In contrast, instances of the largest family, i.e., hardware-verification, roughly appear with the same fraction in the dataset and the sampled benchmarks. Finally, instances of the family cryptography are less important in distinguishing solvers than their vast weight in the dataset suggests. A possible explanation is that these instances are very similar, such that a small fraction of them is sufficient to estimate a solver's performance on all of them.

## 6 Conclusions and Future Work

In this work, we have addressed the New-Solver Problem: Given a new solver, we want to find its ranking amidst competitors. Our approach provides accurate ranking predictions while needing significantly less runtime than a complete evaluation on a given benchmark set. On data from the SAT Competition 2022 Anniversary Track, we can determine a new solver's PAR-2 ranking with about 92 % accuracy while only needing 10 % of the full-evaluation time. We have evaluated several ranking algorithms, instance-selection approaches, and stopping criteria within our sequential active-learning framework. We also took a brief look at which instance families are the most prevalent in selection decisions.

Future work may compare further sub-routines for ranking, instance selection, and stopping. Additionally, one can apply our evaluation framework to arbitrary computation-intensive problems, e.g., other N P-complete problems than SAT, as all discussed active-learning methods are problem-agnostic. Such problems share most of the relevant properties of SAT solving, i.e., there are established instance features, a complete benchmark is expensive, and traditional benchmark selection requires expert knowledge.

From the technical perspective, one could formulate runtime discretization as an optimization problem rather than addressing it empirically. Further, a major shortcoming of our current approach is the lack of parallelization, selecting instances one at a time. Benchmarking on a computing cluster with n cores benefits from having batches of n instances. However, bigger batch sizes n impede active learning. Also, it is unclear how to synchronize instance selection and updates of the prediction model without wasting too much runtime.

Acknowledgments. This work was supported by the Ministry of Science, Research and the Arts Baden-Württemberg, project Algorithm Engineering for the Scalability Challenge (AESC).

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### ParaQooba: A Fast and Flexible Framework for Parallel and Distributed QBF Solving*-*

Maximilian Heisinger<sup>1</sup>() , Martina Seidl<sup>1</sup> , and Armin Biere<sup>2</sup>

<sup>1</sup> JKU Linz, Linz, Austria, {maximilian.heisinger,martina.seidl}@jku.at <sup>2</sup> ALU Freiburg, Freiburg, Germany, biere@informatik.uni-freiburg.de

Abstract. Over the last years, innovative parallel and distributed SAT solving techniques were presented that could impressively exploit the power of modern hardware and cloud systems. Two approaches were particularly successful: (1) search-space splitting in a Divide-and-Conquer (D&C) manner and (2) portfolio-based solving. The latter executes different solvers or configurations of solvers in parallel. For quantified Boolean formulas (QBFs), the extension of propositional logic with quantifiers, there is surprisingly little recent work in this direction compared to SAT. In this paper, we present ParaQooba, a novel framework for parallel and distributed QBF solving which combines D&C parallelization and distribution with portfolio-based solving. Our framework is designed in such a way that it can be easily extended and arbitrary sequential QBF solvers can be integrated out of the box, without any programming effort. We show how ParaQooba orchestrates the collaboration of different solvers for joint problem solving by performing an extensive evaluation on benchmarks from QBFEval'22, the most recent QBF competition.

## 1 Introduction

*Quantified Boolean formulas* (QBFs) extend propositional logic by quantifiers over the Boolean variables [2]. As a consequence, the decision problem of QBF (QSAT) is PSPACE complete, which is potentially harder than the NP-complete decision problem of propositional logic (SAT). Hence, the quantifiers allow for an efficient encoding of many reasoning problems from formal verification, synthesis, and planning [26] that most likely do not have a compact formulation in propositional logic. Over the last decade, considerable progress has been made in sequential QBF solving [22,21]. In contrast to SAT, where conflictdriven clause learning (CDCL) [19] is the predominant solving paradigm, in QBF solving different approaches of orthogonal strength have been presented. Besides QCDCL, the QBF variant of CDCL, which is implemented for example in the solver DepQBF [17], clausal abstraction as implemented in the solver Caqe [23] and abstraction-refinement based expansion as implemented in the solver RaReQs [13] are particularly successful [22,21]. All of these QBF solving approaches considerably benefit from preprocessing, i.e., an extra step before

<sup>-</sup>Supported by the LIT AI Lab funded by the State of Upper Austria.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. https://doi.org/10.1007/978-3-031-30823-9\_22 426–447, 2023.

the actual solving in which certain redundancies of a formula are eliminated in a satisfiability-preserving way with the aim to make it easier for the solver [10].

Despite the vivid development in sequential QBF solving, only few approaches have been presented for parallel and distributed QBF solving [18]. The most recent parallel QBF solvers are HordeQBF [1] which integrates sequential QCDCL-based solvers to obtain a parallel QBF solver and, more recently, a basic implementation of a QBF module based on the parallel SAT solver Para-Cooba [6] with DepQBF as its only backend solver. To the best of our knowledge, besides these two approaches no other parallel QBF solver has recently been presented. The situation in SAT is different: several very powerful parallel and distributed SAT solvers like Mallob [24], Painless [5], and the afore mentioned solver ParaCooba [7] have been released. They show the potential of parallel and distributed approaches impressively by solving hard SAT instances, for example from multiplier verification [15].

In this paper, we present ParaQooba, a novel framework for parallel and distributed QBF solving that integrates search-space splitting based on the Divide-and-Conquer paradigm with portfolio solving. Our framework is built on top of the ParaCooba SAT solving framework and extends its basic nonportfolio QBF solving module. ParaQooba reuses most of ParaCooba's modules providing management and distribution of solver tasks. In addition, we implemented a very generic interface that allows the easy integration of any QBF solver binary into our framework.

Our main contributions are as follows:


ParaQooba is integrated into ParaCooba's and available on GitHub:

#### https://github.com/maximaximal/paracooba

This paper is structured as follows: First we introduce some preliminaries required for the rest of the paper in the following section. We continue with related work in section 3. After that, section 4 summarizes concepts of the ParaCooba solver framework used in our work. Then we introduce how we apply Divideand-Conquer to solving QBF in section 5. Having introduced the background, we present our portfolio ParaQooba module in detail in section 6 and provide an extensive evaluation in section 7. Finally, we summarize our findings and conclude in section 8.

## 2 Preliminaries

We consider QBFs Q.ϕ in *prenex conjunctive normal form* (PCNF) where the *prefix* Q is of the form Q1x1,...,Qnx<sup>n</sup> with Q ∈ {∀, ∃}. The *matrix* ϕ is a propositional formula over the variables x1,...,x<sup>n</sup> in conjunctive normal form (CNF). A formula in CNF is a conjunction (∧) of clauses. A *clause* is a disjunction (∨) of literals. A literal is a variable x, a negated variable ¬x or a (possibly negated) truth constant (true) or <sup>⊥</sup> (false). For a literal <sup>l</sup>, the expression ¯<sup>l</sup> denotes <sup>x</sup> if l = ¬x and it denotes ¬x otherwise. We sometimes write a clause as a set of literals and a CNF formula as set of clauses. Further, it is often convenient to partition the quantifier prefix into *quantifier blocks*, i.e., maximal sets of consecutive sets of variables with the same quantifier type. For example, for the QBF ∀x1∀x2∃y1∃y2.ϕ we also write ∀X∃Y.ϕ with X = {x1, x2} and Y = {y1, y2}. With upper case letters X, Y, . . . (possibly subscripted), we usually denote sets of variables, while with lower case letters x, y, . . . (also possibly subscripted), we denote variables. If ϕ is CNF formula, then ϕ<sup>x</sup>←<sup>t</sup> is the CNF formula obtained from ϕ by replacing all occurrences of variable x by truth constant t ∈ {, ⊥}. Depending on the value of t, variable x is either set to true (if t is ) or to false (if t is ⊥). We define the semantics of QBFs as follows:


Note that we assume that all variables of a QBF are quantified, i.e., we are considering closed formulas only. Further, we use standard semantics of conjunction, disjunction, negation, and truth constants. For example, the QBF φ<sup>1</sup> = ∀x∃y.((x ∨ y) ∧ (¬x ∨ ¬y)) is true, while φ<sup>2</sup> = ∃y∀x.((x ∨ y) ∧ (¬x ∨ ¬y)) is false. As we see already by this small example, the semantics impose an ordering on the variables w.r.t. the prefix. Given a QBF Q.ϕ, we say that x <<sup>Q</sup> y iff x occurs before y in the prefix. If clear from the context, we write x<y. In φ1, we have x<y, while in φ2, we have y<x.

## 3 Related Work

In practical QBF solving, attempts to parallelize and distribute QBF solvers have a long history (cf. [18] for a survey). Already more than 20 years back, the first distributed QBF solver PQSolve [4] was presented, in a time when QCDCL had not been invented yet. With the advent of QCDCL, several attempts have been made to build parallel QCDCL solvers and implement knowledge-sharing mechanisms for learned clauses and cubes. One example of such a solver is PAQuBE [16]. Unfortunately, the code of most of the early approaches is not available anymore. Following the success of Cube-and-Conquer-based searchspace splitting, the QBF solver MPIDepQBF has been presented [14]. While MPIDepQBF does not implement any sophisticated look-ahead mechanisms, it could demonstrate that even without knowledge-sharing considerable speedup could be achieved. These results serve as motivation for the approach presented in this paper. Unfortunately, MPIDepQBF is implemented in an older version of OCaml that does not run on recent systems and relies on now deprecated libraries, making a comparison impossible. As indicated by its name, it is tailored around the sequential QBF solver DepQBF [17]. Another recent MPI-based QBF solver is HordeQBF [1] which implements knowledge sharing for QCDCL solvers. It is designed in such a way that it allows the integration of any QCDCL solver. In order to integrate a solver, it requires that it implements a certain interface, i.e., programming effort is necessary to add a new solver. To the best of our knowledge, it includes the QBF solver DepQBF only. HordeQBF does not perform search-space splitting, but it is a parallel portfolio solver with clauseand cube sharing. It diversifies the parallel solver instances by different parameter settings. This is different than in sequential portfolio solvers as presented in [12], which select among different solvers based on some properties of the input formula. Overall, a very strong focus on QCDCL-based solvers can be observed for parallel QBF solving frameworks. Because of this, many chances for better solving performance are missed, as nowadays there are many other solvers of orthogonal strength. With ParaQooba we provide a simple way of exploiting the power of the different solving approaches without any integration effort.

## 4 ParaCooba

Our novel framework ParaQooba (with *q* in the middle of its name) builds on top of the SAT solver ParaCooba (with *c* in the middle of its name). In this section, we describe the parts of ParaCooba that are relevant for the remainder of this work for our extension of ParaCooba to ParaQooba.

ParaQooba will be made available publicly during the artifact evaluation under the MIT license, similar to ParaCooba [7,6] which is publicly available on GitHub also under the MIT license<sup>3</sup>. ParaCooba is a distributed Cubeand-Conquer (C&C) solver that implements a proprietary peer-to-peer based load balancing protocol. In contrast to standard D&C solvers the splitting of the search-space can both be done upfront by using a look-ahead solver that produces n cubes or online during solving by lookahead or other heuristics. Amongst other information, the cubes are stored in a binary tree, the *solve tree*.

*Solver module.* A *solver module* manages the sequential solver that is responsible for solving a subproblem. Different solver modules have different code-bases, but they also generally share common concepts. A solver module implements a parser task, which is created directly after the module was initiated and serves as its starting point. It parses the input formula in its own worker thread and instantiates a solver manager based on the fully parsed formula. The parser task also creates the first solver task as the root of the solve tree.

<sup>3</sup> github.com/maximaximal/Paracooba

*Solver Tasks.* For ParaCooba, *solver tasks* are paths in the solve tree, whith a *parser task* being used to generate the tree's root. Solver tasks are usually started as children of other tasks, saving references to their parents, with the root solver task being the only exception. A task's depth in the solve tree represents its priority to be worked on: The greater the depth, the more important a task is to be solved locally and the less important it is to be offloaded to other compute nodes by the broker module. Only tasks that were created locally may be distributed.

*Broker module.* The *broker module* handles relations between solver tasks and processes their results. While the solver module generates tasks, the broker schedules them based on their priorities (their depths) and offloads them if a different compute node has less load than the current node. A task result is propagated upwards across compute nodes, there is no conceptual difference between locally and remotely solved tasks. The broker module is generic and does not rely on a specific solver module, instead providing the environment a solver module works in. It is already provided by ParaCooba and stays the same for different solver modules.

*Cube Sources.* For generating concrete subproblems, *cube sources* provide assumption literals to leaf solver tasks. A cube source decides whether a given solver task should split again, based on the current configuration (mainly the splitting depth) and the given formula. Every solver module can implement its own cube source, hence there are different kinds of cube sources for different solver modules. On this basis, very flexible mechanisms for the selection of splitting variables can be implemented, ranging from a simple count of literal occurrences to advanced look-ahead heuristics.

*Task Tree.* The *task tree* built lazily, i.e., only once a leaf is visited, the leaf is either expanded into a sub-tree, or solved. We picture such a tree in Figure 1. This tree has a *depth* of 1, because the path from the tree's root solver task to the leaf solver tasks has a length of 1. Once the active cube source stops further splits from being carried out, the tree's maximum depth is reached. The worker thread currently executing a task then lends a solver instance from the solver manager's central store. Each solver instance is created on-the-fly once (normally initialized based on the parser task) for each worker thread, which can also happen for multiple worker threads in parallel. After a solver instance was created, all other tasks solved by the same worker thread use the same solver instance.

*Guiding Paths.* The cubes that are given to solver instances as assumptions are called *guiding paths*. They are generated from the path to the leaf being solved. The solver instance then handles the solving internally, blocking the worker thread until either result is generated or the task is terminated. Results are not returned to parents, but instead handled by the broker module, which then traverses the solve tree upwards as far as possible, based on the results already in the tree. Different kinds of evaluations can be defined on every level using a userdefined *assessment function*. With the result processed by the broker module, the solver task then finishes and the worker thread can take on the next task, based on the next-highest priority. The broker may delete the solver task after it finished processing, if the result was already used somewhere above it in the tree and no information from the original solver task structure is required anymore. Once the broker module has enough information to solve the root task, the result of the formula was computed successfully.

*Solver Handle.* A *solver handle* wraps instances of a given solver. It must be able to receive an *Assume* event, directly followed by a *Solve* event. While processing these events, a correctly working handle must block its calling thread until a result is found. Additionally, it must be fully re-entrant after finishing processing, so that the next solver task can apply new assumptions. On top of this, a handle must also be able to process a *Terminate* event, stopping the solver and earlyreturning control to its calling thread. Such a termination event may happen at any time, as it is generated by other solver tasks. This possibility of random terminations was an issue for our extension to ParaQooba, as it complicated synchronization of all involved threads.

*QBF Solver Module.* ParaCooba already provided a basic *QBF solver module* similar to the approach seen in MPIDepQBF. It implemented a QDIMACSparser in a new solver module based on the SAT module. It realizes a simple cube source that returns the variable at the nth position in the prefix, with n being the current depth of a solver task. The solve tree is built using two adapted assessment functions: one for variables quantified ∀ (requiring all subtrees to be true), one for ∃ (requiring at least one sub-tree to be true). The assessment functions also use ParaCooba's cancellation-support to terminate unneeded siblings after results already satisfy the respective subproblem. As backend solver, it exclusively uses DepQBF that provides an incremental API (which no other recent solver provides, to the best of our knowledge).

*Summary.* With its already existing tree-based QBF solving module together with its support for distributed solving, ParaCooba provides a stable basis for building an advanced parallel QBF solver. While the existing QBF module is rather uncompetitive with a few exceptions that indicate its potential, its core infrastructure turned out to be very useful to build our novel framework ParaQooba that offers built-in portfolio support.

The networking support mentioned above enables combining multiple compute nodes by giving each peer a connection to the main node. This is achieved with setting the --known-remote option. With this feature it becomes possible to easily distribute larger problem instances on a cluster or in the cloud.

## 5 Architecture of ParaQooba: Combining Divide-and-Conquer Portfolio Solving

Our framework ParaQooba combines Divide-and-Conquer (D&C) search space splitting with portfolio solving. The key feature of ParaQooba compared to ParaCooba is to allow portfolio solving at different search depths. The idea is illustrated in Figure 1. Both approaches are widely used to realize parallel and distributed SAT and QBF solvers. The D&C approach has been especially successful for hard combinatorial SAT problems [11] in a variant called Cube-and-Conquer (C&C). The C&C approach relies on powerful, but expensive lookahead solvers that heuristically decide which variables shall be considered for splitting. In its original SAT version, ParaCooba builds upon this idea [7].

For a QBF Q1XQ2Y Q.ϕ with Q<sup>1</sup> = Q<sup>2</sup> and Q1, Q<sup>2</sup> ∈ {∀, ∃} though, the possible choices for variable selection are more restricted because of the quantifier prefix. In general, only variables from the outermost quantifier block Q1X may be considered, because otherwise, the value of the formula might change. Jordan et al. [14] observed that for QBF following the sequential order of the variables in the first quantifier block already leads to improvements compared to the sequential implementation of DepQBF. The already existing QBF solver module of ParaCooba (see section 4) relied on this observation: it traverses the prefix of a PCNF and splits each visited leaf into two sub-trees, respecting both universal and existential quantifiers, until a pre-defined maximum depth is reached. Hence, it re-implements the approach of MPIDepQBF in ParaCooba.

Our framework ParaQooba generalizes the previous QBF module of Para-Cooba not only by generalizing the interface in such a manner that any QBF solver can be easily (without programming effort) integrated as backend solver. Now it is also possible to run several solvers in the leaves as shown in Figure 2 for one split. Overall, ParaQooba realizes the following approach. The searchspace is split according to the variable ordering of the prefix until a given depth. Once one of the sub-trees of an existentially quantified variable split is found to be true, the other sibling is terminated. Only when both siblings return false, the whole split returns false. Universal splits work in a dual manner: the result is only true if both sub-trees are found to be true and false otherwise. This property of QBF enables efficient termination of sub-tasks.

In ParaQooba, we now also parallelize each solver call over several QBF solvers with orthogonal strategies. Compared to prior approaches [18], we run a portfolio of multiple solvers in the leaves of the solve tree instead of only parallelizing its root. Having just one tree leads to several advantages: We are more flexible and may also call a preprocessor (e.g. Bloqqer) before each solve call. We also only instantiate the tree once, saving memory and enabling earlytermination of sibling solver tasks.

## 6 Implementation

This section describes the extension of the SAT solver ParaCooba (for an overview see section 4) to our QBF solving framework ParaQooba. As Para-

Fig. 1: Divide-and-Conquer with arbitrary-many levels of splitting and subformulas on the leaves solved by a portfolio of different sequential solvers

Cooba was originally not designed for portfolio support, several modifications and extensions were necessary. To this end, we first present the new QBF module of ParaQooba followed by a discussion of novel search-space pruning facilities.

## 6.1 The ParaQooba QBF Module

We generalized the already existing QBF solver handle to become an abstract base class, which now can be either a single solver handle or a *portfolio handle*. The latter unifies multiple handles into one, emulating a blocking and re-entrant interface. Once a portfolio handle is initialized, it starts one thread per internally wrapped handle. Each such thread implements a small state machine, waiting for events on a shared queue. Once the portfolio handle receives an assumption (a temporary truth assignment of a variable for one solver call), it is forwarded to all internal threads and is worked on by each wrapped solver in parallel.

If a portfolio handle was terminated before a solve call was issued, the internal handles would enter an invalid state. To circumvent this situation, an assumption event also directly triggers the internal state machine to continue into the solve state. Once the solve request actually arrives, it is just translated to an empty event, which, after it finished processing, indicates that a result was computed. A termination event is forwarded to the internal solver handles, but is limited to only one event per solve cycle.

Fig. 2: The ParaQooba framework

The first internal solver handle to compute a result returns and sends a termination event to all sibling solvers. The result is saved and the portfolio handle waits for all internal handles to be ready to receive the next assumption, i.e., returning all solvers to a known state. Once every internal handle has reached that, the portfolio handle finally returns to its calling thread, forwarding the result of the inner handle. Because of thread scheduling and fast solving of trivial subproblems, a result can be forwarded even before the other sibling has been started, letting the broker module already complete a task before it itself has created both child tasks. This effect lead to some issues and had to be mitigated by adding some conditions on a task already being terminated even though it did not yet run to completion. Because a task will only be scheduled after the initial call to its assessment function, not many such checks were needed.

As many QBF solvers lack APIs, we have to work with their binaries that generally only read QDIMACS files. For this, we use the QuAPI interfacing library, that adds well-performing assumption-based reasoning support to generic solver binaries [9]. By not relying on specialized modifications of a solver's source code, we are able to plug-in generic third-party solvers, completely composable at runtime. Our ParaQooba module provides the --quapisolver parameter, that either directly specifies the leaf solver to be used, or automatically generates a portfolio handle to wrap multiple parallel leaf solvers. Note that our approach works for QBFs starting with existential as well as with universal quantification.

In its standard configuration, ParaQooba returns whether a given instance is found to be true or false. When enabling trace output using -t, it also supports printing the specific solver and the subproblem (including its guiding path) that produced a result. Using this machinery, one obtains an environment to experiment with benchmarks and to see how multiple solvers complement each other for the generated sub-formulas. The trace output is also useful when fully expanding a QBF formula by specifying a tree-depth of -1. While not advised for any real formulas, this was a well-received debugging aid for stress-testing new features. The opposite to this can also be done, by applying a tree-depth of 0. This directly solves the root task, without splitting the formula. This was also how the configuration PQ Portfolio with depth 0 (as discussed in the experimental evaluation below) was executed.

### 6.2 Search-Space Pruning

*Preprocessing in the leaves.* We modified the QBF preprocessor Bloqqer to allow forwarding output directly into a given solver binary by adding a -p argument. Internally, this writes the complete formula with added assumptions into the standard input of Bloqqer's preprocessing pipeline.

To plug e.g. Caqe into such a processing chain and then into ParaQooba, one may use our QBF solver module's command line option --quapisolver bloqqer-popen@-p=caqe. Deferring preprocessing until solving the leaves preserves the original formula structure of a formula during the split phase. We discuss the effects of this later in subsection 7.4.

*Integer-Split Reduction.* In many planning and verification encodings, the variables of a quantifier block QX are interpreted as bitvectors representing m nodes of a graph. Assume that <sup>n</sup> <sup>=</sup> <sup>|</sup>X<sup>|</sup> bits with <sup>m</sup> <sup>≤</sup> <sup>2</sup><sup>n</sup> are used for modeling the states of the graph. Then <sup>2</sup><sup>n</sup> <sup>−</sup> <sup>m</sup> assignments to <sup>X</sup> are not relevant, but as a solver is agnostic of this information, it has to consider all assignments.

If m is known to the user, ParaQooba can be called with the option --intsplit (once or multiple times, once for each layer). One integer-split is counted as one layer in the task tree, so a tree-depth of two would split another quantifier into two more tasks for each state encoded in the previous integerbased split. To provide an example: Setting --intsplit 5 creates 5 child-tasks in the task tree, spanning over the first log<sup>2</sup> 5 = 3 boolean variables from the quantifier prefix. When not using doing an integer-based split, these 3 variables would have to be expanded over 3 layers in the task tree, each inner task being split into two child tasks, resulting in 8 leaves , opposed to the 5 from before. Thus, integer-based splits require less intermediate splitting tasks to model the same formula, reducing the work to be done by the load-balancing mechanism in the Broker module. These integer splits are efficiently distributed over the network by relying on both the config-system and an extended QBF cube source. The cube source always saves the current guiding path, applying new splits, and in turn new assumptions, by appending to that path. The cube source itself is automatically serialized when a task is chosen to be offloaded to another compute node. While the possible savings are large, one has to exert great caution when using this feature, as it might change the semantics of a formula.

## 7 Evaluation

In this section, we evaluate ParaQooba on recent benchmarks and compare it to (sequential) state-of-the-art QBF solvers. As sequential backend solvers, we use the latest versions of DepQBF [17] as QCDCL solver, Caqe [23] as clausalabstraction solver, and RaReQs [13] as recursive abstraction refinement solver. For preprocessing, we use Bloqqer [3] (version 31). All of these solvers were topranked in the most recent edition of QBFEval'22 [22]. For our experiments we used the benchmarks of the PCNF-track of this competition. The main questions we want to answer with our evaluation are as follows:


We ran our experiments on machines with dual-socket 16 core AMD EPYC 7313 processors with 3.7 GHz sustained boost clock speed and 256 GB main memory. Each task was assigned as many physical cores as its setup required, except for tasks with more than 32 concurrent threads, which were exclusively

Fig. 3: Full summary of all solved instances with all different solvers without preprocessing. While Divide-and-Conquer (Depth 4) formulas solves 33 instances that no sequential solver solved, it solves 28 instances less in total.

Fig. 4: Full summary of all solved instances with all different solvers with Bloqqer preprocessing. PQ Portfolio (Depth 4) solves 45 instances no sequential solver could solve and solves 3 more in total.

assigned a whole node each as to not be slowed down by other loads. The effects of over-committing in case of three concurrent portfolio solvers (48 threads running in parallel with only 32 physical cores available) are discussed below in subsection 7.3.

Please note that in this evaluation we do not use the networking features provided by ParaCooba, as we focus on applicability to QBF and not on the already presented scalability of the networking component (for the details see [3]).

#### 7.1 Overall Performance Comparison

In order to exploit our hardware with 32 physical cores and 64 logical cores in the best possible way, we mainly focus on a *splitting depth of four* in the following. With this depth, 16 worker threads are generated for each problem and with three sequential backend solvers, overall 48 processes are started. We call this configuration *PQ Portfolio, Depth 4*. For understanding the impact of splitting, we also consider other depths as well. With *PQ Portfolio, Depth 0* we refer to the configuration in which splitting is disabled. This configuration is particularly interesting, because compared to the virtual best solver (VBS), it reveals the overhead introduced by our framework (see also the discussion below). In order to show the improvements of ParaQooba compared to the QBF module without portfolio solving that was already available in ParaCooba [6], we also included the configuration *PQ DepQBF, Depth 4*.

Figure 3 shows the overall results of our evaluation *without preprocessing*. Both configurations of ParaQooba, *PQ Portfolio, Depth 0* and *PQ Portfolio, Depth 4* are considerably better than the single sequential solvers as well as the basic non-portfolio QBF module of ParaCooba only solving with DepQBF (PQ DepQBF, Depth 4). However, compared to the virtual portfolio, 28 instances less are solved in total (for an explanation see below). On the positive side, 33 formulas can be solved by our new approach that could not be solved by any sequential solver. The situation changes when preprocessing is applied (cf. Figure 4). Now ParaQooba in configuration *PQ Portfolio Preprocessed Formulas, Depth 4* is able to solve most formulas. It even solves more formulas than the *Preprocessed Virtual Portfolio*, indicating the potential of our approach.

A detailed analysis is given in Figure 5. By comparing the number of solved instances to the solve time of individual (preprocessed) problem instances, we see a small average speedup when using ParaQooba with depth 4 compared to a virtual portfolio solver in Figure 5a. The more trivial instances tend to be solved quicker using a sequential solver, while the harder to solve instances tend to be solved faster with the Divide-and-Conquer approach of ParaQooba.

Next, we used the preprocessed leaves functionality introduced in subsection 6.2. Here ParaQooba generates its guiding paths using the original formula and applies Bloqqer only in the leaves of the solve tree. In this configuration, some problem instances take longer to solve than when preprocessing the full formula, while others can be solved quicker. We present these results in Figure 5b. Such a result was expected, as it is conceptually similar to inprocessing.

(c) ParaQooba: preprocessing of leaf formulas compared to preprocessing of input formula

(d) ParaQooba with Depth 4 compared to Virtual Portfolio on Hex formulas

Fig. 5: Detailed comparison of ParaQooba against the virtual portfolio of DepQBF, Caqe, and RaReQs in a, b, d. In a, ParaQooba solves 45 instances that no sequential solver could solve. In b, ParaQooba solves 38 instances no sequential solver could solve, 8 of which also could not be solved with portfolio over preprocessed formulas as in a. d focuses only on preprocessed formulas from the Hex benchmark family. In c, we directly compare preprocessing in the leaves to preprocessing in the input formula.

Fig. 6: Preprocessed formulas of the Hex positional game planning [20,25] benchmarks from the QBF22 benchmark set. Also compared to HordeQBF [1] as available state-of-the-art parallel QBF solver.

When considering the formulas that were exclusively solved by ParaQooba, then the variant with preprocessing the full formula up-front performed best followed by the variant with preprocessing in the leaves. These formulas include verification and synthesis benchmarks with 2–3 quantifier alternations as well as many encodings of the game Hex with 13, 15 or 17 quantifier alternations. Table 1 in the appendix lists all instances (48) that were only solved with some variant of ParaQooba. It also lists which variant was the fastest.

#### 7.2 Family-Based Analysis

To understand which formula families benefit most from our Divide-and-Conquer solving strategy, we compared the (wall-clock) solve time of ParaQooba to the virtual portfolio solver. We calculated the speedup by dividing the solve time of the sequential solver by the solve time of ParaQooba. The instances with the highest speedups were some reachability queries (up to 18.09), the Hex game planning family (17.64), multipliers (16.46), and the formula\_add family (15.16). More detailed results are appended in Table 2. Together with the number of Hex instances only ParaQooba solved (21), this makes Hex game planning the benchmark family with the best overall results in our evaluation. A comparison between ParaQooba and other solvers is shown in Figure 6.

#### 7.3 Scalability of our Approach

As already discussed above, using 16 workers leads to overcommitting cores when solving with a portfolio of more than two solvers. To quantify this, we did

Fig. 7: Hex Scalability with preprocessed formulas. Depth 4 suffers from overcommitting the available CPU-cores on our hardware and is relatively slow for the first few problems, but still solves more instances overall.

a scalability experiment with different worker counts. Because the Hex planning benchmarks had the most predictable performance, we focused this experiment on these formulas. Figure 7 shows the scalability graph, where the X-axis has been multiplied by the number of workers used, to visualize the cost of increased CPU-time compared to reduced wall-clock solve time. The impact of over-committing CPU cores can be clearly observed in the results of the portfolio with depth 4. This curve solves more compared to the others and takes longer to solve the first 140 instances, until the curves become more similar again.

#### 7.4 Preprocessed Leaves compared to Preprocessed Formulas

We compared preprocessing the whole formula at once using Bloqqer to calling Bloqqer using bloqqer-popen in each leaf after first splitting on the unchanged formula. The first variant modifies the original prefix, including the quantifier ordering. Because the used splitting algorithm generates guiding paths by following this quantifier ordering, the different approaches lead to vastly different results. Figure 5c visualizes these differences by scattering both variants together.

Looking at the specific benchmarks benefiting from the two variants, we often observed improvements to one variant per family. This strongly suggests that adaptive preprocessing and inprocessing techniques could further improve solving performance, even without otherwise changing solvers themselves.

#### 7.5 Lessons Learned

One would expect that for any given problem, parallel portfolio solvers are as fast as the fastest used solver. While this statement is conceptually true, we encountered some formulas where PQ-Portfolio gave comparatively bad results, while a solver alone could solve the same formula quicker or even instantly. We investigated this in more detail and found several segmentation faults in Caqe and API inconsistencies in DepQBF that were encountered because of some corner-case structures of the generated subproblems (e.g., by enforcing the values of certain variables). We reported these issues to the solver developers and hope to obtain fixes soon. Having this issues fixed would lead to a more performant general solution and to a more robust user experience. In sequential execution of these solvers, we did not encounter any problems on the unmodified competition benchmarks without added unit clauses.

Currently, we adopt the following work-around. Segmentation faults of the sequential solvers are handled in our QBF module using the indirection provided by QuAPI. Once an unrecoverable error occurs in the solver child process, it exits and returns the error up through QuAPI's factory process and into the solver handle. There, such a result is interpreted as *Unknown*, which is invalid and therefore ignored, letting the portfolio wait for other results. We provide all affected formulas that we found in the artifact submitted alongside this paper.

We also observed that calling a solver via its API might lead to a considerably different behavior than calling a solver from the command line, i.e., different optimizations are activated when calling a solver through its API compared to using the command-line binary. Such behavior can be mitigated by not using the API directly, and instead relying on QuAPI, even if an API would be available. This fixes the issues with DepQBF, which solves some formulas (with assumptions supplied as unit clauses) in under one second if used as a solver binary, but not when applying assumptions through its API. We also supply all found formulas that triggered this issue in the submitted artifact.

## 8 Conclusions

We presented ParaQooba, a parallel and distributed QBF solving framework that combines search-space splitting with portfolio solving. We designed the framework in such a way that any sequential QBF solver binary can be easily integrated without any implementation effort. Our experiments demonstrate that this approach in combination with sequential preprocessing lead to considerable performance improvements for certain formula families.

With our framework, we provide a stable infrastructure that has the potential for many future extensions. For example, we did not incorporate any advanced splitting heuristics as in modern Cube-and-Conquer solvers. We expect that with more advanced heuristics, combined with adaptive but possibly non-deterministic re-splitting of leaves, even more speedups could be achieved.

In addition to the presented experiments, we also evaluated the novel integersplit feature (cf. subsection 6.2) with the Hex benchmark family. By providing the number of valid game states to ParaQooba, we could increase the splitting depth as well as the number of solved instances. We see much potential of providing encoding-specific or domain-specific knowledge to the solver and will investigate this in future work.

## Data Availability Statement

Data used for benchmarking the described software, including source code, are made available permanently under a permissive license in a public artifact on Zenodo. Raw source data for the figures presented in this paper are also included [8].

## A Instances Only Solved by ParaQooba



Table 1: 48 instances that were only solved by a ParaQooba configuration. QA: Quantifier Alternations, Res: Result, Variant: ParaQooba configuration that solved the problem the fastest (preprocess full formula, preprocess leaves, original formula).

## B Instances Solved faster by ParaQooba


Table 2: Instances that ParaQooba (PQ) solved faster compared to a virtual portfolio solver (VPS) that also solved the same problem, ordered by the relative speedup and limited to the top 25 entries. Res: Result, Speedup: VPS[s] PQ[s] .

## References

1. Balyo, T., Lonsing, F.: HordeQBF: A modular and massively parallel QBF solver. In: Creignou, N., Berre, D.L. (eds.) Proc. of the 19th Int. Conf. on Theory and Applications of Satisfiability Testing (SAT). Lecture Notes in Computer Science, vol. 9710, pp. 531–538. Springer (2016). https://doi.org/10.1007/ 978-3-319-40970-2\_33


the 15th Int. Conf. on Theory and Applications of Satisfiability Testing (SAT). Lecture Notes in Computer Science, vol. 7317, pp. 114–128. Springer (2012). https://doi.org/10.1007/978-3-642-31612-8\_10


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Inferring Needless Write Memory Accesses on Ethereum Bytecode?

Elvira Albert<sup>1</sup> , Jes´us Correas<sup>1</sup> , Pablo Gordillo<sup>1</sup> () , Guillermo Rom´an-D´ıez<sup>2</sup> , and Albert Rubio<sup>1</sup>

<sup>1</sup> Complutense University of Madrid, Madrid, Spain pabgordi@ucm.es

<sup>2</sup> Universidad Polit´ecnica de Madrid, Madrid, Spain

Abstract. Efficiency is a fundamental property of any type of program, but it is even more so in the context of the programs executing on the blockchain (known as smart contracts). This is because optimizing smart contracts has direct consequences on reducing the costs of deploying and executing the contracts, as there are fees to pay related to their bytes-size and to their resource consumption (called gas). Optimizing memory usage is considered a challenging problem that, among other things, requires a precise inference of the memory locations being accessed. This is also the case for the Ethereum Virtual Machine (EVM) bytecode generated by the most-widely used compiler, solc, whose rather unconventional and low-level memory usage challenges automated reasoning. This paper presents a static analysis, developed at the level of the EVM bytecode generated by solc, that infers write memory accesses that are needless and thus can be safely removed. The application of our implementation on more than 19,000 real smart contracts has detected about 6,200 needless write accesses in less than 4 hours. Interestingly, many of these writes were involved in memory usage patterns generated by solc that can be greatly optimized by removing entire blocks of bytecodes. To the best of our knowledge, existing optimization tools cannot infer such needless write accesses, and hence cannot detect these inefficiencies that affect both the deployment and the execution costs of Ethereum smart contracts.

## 1 Introduction

EVM and memory model. Ethereum [27] is considered the world-leading programmable blockchain today. It provides a virtual machine, named EVM (Ethereum Virtual Machine) [21], to execute the programs that run on the blockchain. Such programs, known as Ethereum "smart contracts", can be written in high-level programming languages such as Solidity [6], Vyper [4], Serpent [3] or Bamboo [1] and they are then compiled to EVM bytecode. The EVM bytecode is the code finally deployed in the blockchain, and has become a uniform format to develop analysis and optimization tools. The memory model of EVM programs has been described in previous work [17,19,26,27]. Mainly, there are three

c The Author(s) 2023

<sup>?</sup> This work was funded partially by the Spanish MCIU, AEI and FEDER (EU) projects PID2021-122830OB-C41 and PID2021-122830OA-C44 and by the CM project S2018/TCS-4314 co-funded by EIE Funds of the European Union.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 448–466, 2023. https://doi.org/10.1007/978-3-031-30823-9 23

regions in which data can be stored and accessed: (1) The EVM is a stack-based virtual machine, meaning that most instructions perform computations using the topmost elements in a machine stack. This memory region can only hold a limited amount of values, up to 1024 256-bit words. (2) EVM programs store data persistently using a memory region named storage that consists of a mapping of 256-bit addresses to 256-bit words and whose contents persist between external function calls. (3) The third memory region is a local volatile memory area that we will refer to as EVM memory, and which is the focus of our work. This memory area behaves as a simple word-addressed array of bytes that can be accessed by byte or as a one-word group. The EVM memory can be used to allocate dynamic local data (such as arrays or structs) and also for specific EVM bytecode instructions which have been designed to require some lengthy operands to be stored in local memory. This is the case of the instructions for computing cryptographic hashes, or for passing arguments to and returning data from external function calls. Compilers use the stack and volatile memory regions in different ways. The most-used Solidity compiler solc generates EVM code that uses the stack for storing value-type local variables, as well as intermediate values for complex computations and jump addresses, whereas reference-type local variables such as array types and user-defined struct types are located in memory. For instance, when a Solidity function returns a struct variable, the required memory for the struct is allocated and initialized at the beginning of the function execution. However, the allocated memory is not always accessed as we illustrate in the following function (that belongs to the contract in Fig. 1):

```
1 function ownershipAt ( uint256 i ) p ri v a t e re tu rns (TokenOwnership memory) {
2 retu rn c . unpackedOwnership ( packedOwnerships [ i ] ) ;
3 }
```
Although the execution of \_ownershipAt allocates memory for the return value declared in the function definition, the execution of the function is reserving a different memory space for the actual returned struct obtained from unpackedOwnership and, thus, the first reservation and its initialization are needless. The focus of our work is on detecting such needless write memory accesses on the code generated by solc. Nevertheless, as the analysis works at EVM level, it could be easily adapted to EVM code generated by any other compiler.

Optimization. Optimization of Ethereum smart contracts is a hot research topic, see e.g. [9, 10, 12–14, 22, 24] and their references. This is because the reduction of their costs is relevant for three reasons: (1) Deployment fees. When the contract is deployed on the blockchain, the owner pays a fee related to the size in bytes of the bytecode. Hence, a clear optimization criterion is the bytes-size of the program. The Solidity compiler solc [6] has as optimization target such bytes-size reduction. (2) Gas-metered execution. There is a fee to be paid by each client to execute a transaction in the blockchain. This fee is a fixed amount per transaction plus the cost of executing all bytecode instructions within the function being invoked within the transaction. This cost is measured in "gas" (which is then priced in the corresponding cryptocurrency) and this is why the execution is said to be gas-metered. The EVM specification ([27] and more recent updates)

provides a precise gas consumption for each bytecode instruction in the language. The goal of most EVM bytecode optimization tools [9, 10, 12–14, 22] is to reduce such gas consumption, as this will revert on reducing the price of all transactions on the smart contract. (3) Enlarging Ethereum's capability. Due to the huge volume of transactions that are being demanded, there is a huge interest in enlarging the capability of the Ethereum network to increase the number of transactions that can be handled. Optimization of EVM bytecode in general –and of its memory usage in particular– is an important step contributing into this direction.

Challenges and contributions. Optimizing memory usage is considered a challenging problem that requires a precise inference of the memory locations being accessed, and that usually varies according to the memory model of the language being analyzed, and to the compiler that generates the code to be executed. In the case of Ethereum smart contracts generated by the solc compiler, the memory model is rather unconventional and its low-level memory usage patterns challenge automated reasoning. On one hand, instead of having an instruction to allocate memory, the allocation is performed by a sequence of instructions that use the value stored at address 0x40 as the free memory pointer, i.e., a pointer to the first memory address available for allocating new memory. In the general case, the memory is structured as a sequence of slots: a slot is composed of several consecutive memory locations that are accessed in the bytecode from the same initial memory location plus a corresponding offset. A slot might just hold a data structure created in the smart contract but also, when nested data structures are used, from one slot we can find pointers to other memory slots for the nested components. Finally, there are other type of transient slots that hold temporary data and that need to be captured by a precise memory analysis as well. These features pose the main challenges to infer needless write accesses and, to handle them accurately, we make the following main contributions: (1) we present a slot analysis to (over-)approximate the slots created along the execution and the program points at which they are allocated; (2) we then introduce a slot usage analysis which infers the accesses to the different slots from the bytecode instructions; (3) we finally infer needless write accesses, i.e., program points where the memory is written but is never read by any subsequent instruction of the program; and (4) we implement the approach and perform a thorough experimental evaluation on real smart contracts detecting needless write accesses which belong to highly optimizable memory usage patterns generated by solc. Finally, it is worth mentioning that the applications of the memory analysis (points 1 and 2) go beyond the detection of needless write accesses: a precise model of the EVM memory is crucial to enhance the accuracy of any posterior analysis (see, e.g., [19] for other concrete applications of a memory analysis).

## 2 Memory Layout and Motivating Examples

Memory Opcodes. The EVM instruction set contains the usual instructions to access memory: the most basic instructions that operate on memory are MLOAD

```
4 s t r uc t TokenOwnership {
5 address addr ;
6 uint64 s t a r tT s ;
7 bool burned ;
8 }
9
10 contract Running1 {
11 // . . .
12 function unpackedOwnership
13 ( uint256 packed ) public
14 s1s2 re tu rns (TokenOwnership
       memory ownership ) {
15 ownership . addr = . . . ;
16 ownership . s t a r tT s = . . . ;
17 ownership . burned = . . . ;
18 }
19 }
                                   17 contract Running2 {
                                   18 Running1 c ;
                                   19 mapping( uint256=>uint256 ) p ri v a t e packedOwnerships ;
                                   20 // . . .
                                   21 function ownershipAt ( uint256 i ) p ri v a t e
                                   22 s6 re tu rns (TokenOwnership memory) {
                                   23 s7 retu rn c . unpackedOwnership ( packedOwnerships [ i ] ) ;
                                   24 }
                                   25 function explici tOwne rshipO f ( uint256 tokenId )
                                   26 s3 public re tu rns (TokenOwnership memory) {
                                   27 s4 TokenOwnership memory ownership ;
                                   28 s5 i f ( . . . ) { retu rn ownership ; }
                                   29 s8 ownership = ownershipAt ( tokenId ) ;
                                   30 // . . .
                                   31 s5 retu rn ownership ;
                                   32 }
                                   33 }
```
Fig. 1: Excerpt of smart contract ERC721A.

and MSTORE, which load and store a 32-byte word from memory, respectively.<sup>3</sup> The solc compiler generates code to handle memory with a cumulative model in which memory is allocated along the execution of the program and is never released. In contrast to other bytecode virtual machines, like the Java Virtual Machine, the EVM does not have a particular instruction to allocate memory. The allocation is performed by a sequence of instructions that use the value stored at address 0x40 as the free memory pointer, i.e., a pointer to the first memory address available for allocating new memory. In what follows, we use memhxi to refer to the content stored in memory at location x.

Memory Slots. In the general case, memory is structured as a sequence of slots. A slot is composed of consecutive memory locations that are accessed by using its initial memory location, which we call the base reference (baseref for short) of the slot, plus the corresponding offset needed to access a specific location within the slot. Slots usually store (part of) some data structure created in the Solidity program (e.g., an array or a struct) and whose length can be known.

Example 1 (slots). Fig. 1 shows an excerpt of smart contract ERC721A [2] which contains two different contracts Running1 and Running2. We have omitted non-relevant instructions such as those that appear at lines 15-17 (L15-L17 for short). The contract Running1 to the left of Fig. 1 contains the public function unpackedOwnership that returns a struct of type TokenOwnership defined at L4- L7. The contract Running2, shown to the right, contains the public function explicitOwnershipOf that returns, depending on a non-relevant condition, an empty struct of type TokenOwnership (L29) or the TokenOwnership received from a call to function unpackedOwnership of contract Running1 (L23), which is done in the private function \_ownershipAt. The execution of function unpackedOwnership in Running1 allocates two different memory slots at L13: s1, for the returned variable ownership, and s2, which is used for actually returning from the function the contents of ownership:

<sup>3</sup> Although the local memory is byte addressable with instruction MSTORE8, to keep the description simpler, we only consider the general case of word-addressable MSTORE.

The function explicitOwnershipOf in Running2 makes a more intensive use of the memory which can be seen in this graphical representation:

The execution of this function might create up to six different slots. At L27 and L28, it creates two slots, one for the struct declared in the returns part of the function header (s3) and one for the local variable ownership (s4). Depending on the evaluation of the condition in the if sentence, it might create the slots needed to perform the call to \_ownershipAt and, consequently, the external call to Running1.unpackedOwnership. The invocation to the private function involves three slots: one for the struct declared in the returns part of \_ownershipAt in L31 (s6), one slot to manage the external call data in L23 (s7), and one slot for storing the results of the private function \_ownershipAt in L31 (s8). Finally, a new slot (s5) is created for returning the results of explicitOwnershipOf. This new slot might contain the contents of s<sup>4</sup> or s8, depending on the if evaluation.

When an amount of memory t is to be allocated, the slot reservation is made by reading and incrementing the free memory pointer (memh0x40i) t positions. From this update on, the base reference to the slot just allocated is used, and subsequent accesses to the slot are performed by means of this baseref, possibly incremented by an offset.

Example 2 (memory slot reservation). The following excerpt of EVM code allocates a slot of type TokenOwnership. The EVM bytecode performs three steps:

(i) load the current value of the free memory pointer memh0x40i that will be used as the baseref of the new slot; (ii) compute the new free memory address by adding t to the baseref; and (iii), store the new free memory pointer in memh0x40i. Additionally, in the same block of the CFG, the slot reservation is followed by the slot initialization at 0x19A, 0x1AB and 0x1B4.


Solidity reference type values such as arrays, struct typed variables and strings are stored in memory using this general pattern, with some minor differences. However, there are some cases in which the steps detailed above vary and the size of the slot is not known in advance, and thus the free memory pointer cannot be updated at this point. For instance, when data is returned by an external call, its length is unknown beforehand and hence the free memory pointer is updated only after the memory pointed to is written. In other cases, the free memory is used as a temporary region with a short lifetime, as in the case of parameter

passing to external calls, and the free memory pointer is not updated. These variants of the general schema must be detected by a precise memory analysis. To this end, we consider that a slot is in transient state when its baseref has been read from memh0x40i but the free memory pointer has not been updated, and it is in permanent state when the free memory pointer has been pushed forward.

Example 3 (transient slot). Now we focus on the external call in L23 of Running2, which performs a STATICCALL, reading from the stack (see [27] for details) the memory location of the input arguments and the location where the results of the call will be saved. Interestingly, both locations reuse the same slot (it corresponds to s7) as it can be seen in the following EVM bytecode from \_ownerShipAt:


The call starts by reading the free memory pointer (at 0x114) and storing at that address the arguments' data (which include the function selector as first argument). Importantly, the pointer is not pushed forward when the input arguments are written and thus the slot remains in transient state. Once the call at 0x139 is executed, the result is written to memory from the baseref on (overwriting the locations used for the input arguments) and the slot is finally made permanent by reading the free memory pointer again (0x151) and updating it (0x160) by adding the actual return data size (RETURNDATASIZE).

Transient slots are also used when returning data from a public function to an external caller. In that case, the EVM code of the public function halts its execution using a RETURN instruction. It reads from the stack the memory location where the length and the data to be returned are located. However, it does not change memh0x40i because the function code halts its execution at this point, as we can see in the EVM code of explicitOwnershipOf (corresponds to slot s5):


The baseref for the return slot is read (at 0x4D) and it is used as a transient slot to write the struct contents to be returned by adding the corresponding offset for each field contained in the struct (instructions on the left column). The code on the left ends with the baseref plus the size of the stored data on top of the stack. After that, the baseref is read again (top of the right column) and the length of the returned data is computed (by subtracting the baseref to the baseref plus the size of the stored data) before calling the RETURN instruction.

## 3 Inference of Needless Write Accesses

This section presents our static inference of needless write accesses. We first provide some background in Sec. 3.1 on the type of control-flow-graph (CFG) and static analysis we rely upon. Then, the analysis is divided into three consecutive steps: (1) the slot analysis, which is introduced in Sec. 3.2, to identify the slots created along the execution and the program points at which they are allocated; (2) the slot usage analysis, presented in Sec. 3.3, which computes the read and write accesses to the different slots identified in the previous step; and (3) the detection of needless write accesses, given in Sec. 3.4, which finds those program points where there is a write access to a slot which has no read access later on.

## 3.1 Context-Sensitive CFG and Flow-Sensitive Static Analysis

The construction of the CFG of Ethereum smart contracts is a key part of any decompiler and static analysis tool and has been subject of previous research [15, 16, 25]. The more precise the CFG is, the more accurate our analysis results will be. In particular, context-sensitivity [16] on the CFG construction is vital to achieve precise results. Our implementation of context-sensitivity is realized by cloning the blocks which are reached from different contexts.

Example 4 (context-sensitive CFG). The EVM code of Running2 creates multiple slots for handling structs of type TokenOwnership. Interestingly, all these slots are created by means of the same EVM code shown in Ex. 2, which corresponds to the CFG block that starts at program point 0x175. As this block is reached from different contexts, the context-sensitive CFG contains three clones of this block: 0x175, which creates s<sup>3</sup> at L27; 0x175\_0, which creates s<sup>4</sup> used at L28; and 0x175\_1, which reserves s6, created at L22. Block cloning means that program points are cloned as well, and we adopt the same subindex notation to refer to the program points included in the cloned block: e.g. program point 0x178 contains the MLOAD 0x40 that gets the baseref of the slot reserved at block 0x178, and 0x178\_0 to the same MLOAD but at 0x178\_0, etc.

In what follows, we assume that cloning has been made and the memory analysis using the resulting CFG (with clones) is thus context-sensitive as well, without requiring additional extensions. As usual in standard analyses [23], one has to define the notion of abstract state which defines the abstract information gathered in the analysis and the transfer function which models the analysis output for each possible input. Besides context-sensitivity, the two analyses that we will present in the next two sections are flow-sensitive, i.e., they make a flow-sensitive traversal of the CFG of the program using as input for analyzing each block of the CFG the information inferred for its callers. When the analysis reaches a CFG block with new information, we use the operation t to join the two abstract states, and the operator v to detect that a fixpoint is reached and, thus, that the analysis terminates. The operations t and v, the abstract state, and transfer function, will be defined for each particular analysis.

#### 3.2 Slot Analysis

The slot analysis aims at inferring the abstract slots, which are an abstraction of all memory allocations that will be made along the program execution. The slots inferred are abstract because over-approximation is made at the level of the program points at which slots are allocated. Therefore, an abstract slot might represent multiple (not necessarily consecutive) real memory slots, e.g., when memory is allocated within a loop. The slot analysis will look for those program points at which the value stored in memh0x40i is read for reserving memory space. These program points are relevant in the analysis for two reasons: firstly, to obtain the baseref of the memory slot, and, secondly, because from this point on, the memory reservation of the corresponding slot has started and it is pending to become permanent at some subsequent program point. The output of the slot analysis is a set which contains the allocated abstract slots, named Sall in Def. 2 below. Each allocated abstract slot (i.e., each element in Sall) is in turn a set of program points, as the same abstract slot might have several program points where memh0x40i is read before its reservation becomes permanent. In order to obtain Sall, the memory analysis makes a flow-sensitive traversal of the (context-sensitive) CFG of the program that keeps at every program point the set of transient slots (i.e. whose baseref has been read but it has not yet made permanent) and applies the transfer function in Def. 1 to each bytecode instruction within the blocks until a fixpoint is reached. An abstract state of the analysis is a set S ⊆ ℘(PR), where P<sup>R</sup> is the set of all program points at which memh0x40i is read. The analysis of the program starts with S = {∅} at all program points and takes t and v as the set union and inclusion operations. Termination is trivially guaranteed as the number of program points is finite and so is ℘(PR). In what follows, Ins is the set of EVM instructions and, for simplicity, we consider MLOAD 0x40 and MSTORE 0x40 as single instructions in Ins.

Definition 1 (slot analysis transfer function). Given a program point pp

with an instruction I ∈ Ins, an abstract state S, and K = {MSTORE 0x40, RETURN, REVERT, STOP, SELFDESTRUCT}, the slot analysis transfer function ν is defined as a mapping ν : Ins × ℘(S) 7→ ℘(S) computed according to the following table:


Let us explain intuitively how the above transfer function works. As we have seen in Sec. 2, in an EVM program all memory reservations start by reading memh0x40i by means of a MLOAD instruction preceded by a PUSH 0x40 instruction (case 1 in Def. 1). In this case, the transfer function adds to all sets in S the current program point, since this is, in principle, an access to the same slots that were already open at this program point and are not permanent yet. To properly identify the slots, our analysis also searches for those program points at which slots reservations are made permanent (case 2 in Def. 1), i.e., those program points with instructions I ∈ K. The most frequently used instruction to make a slot reservation permanent is a write access to memh0x40i using MSTORE, that pushes forward the free memory pointer such that any subsequent read access to memh0x40i will allocate a different slot. The rest of instructions in K finalize the execution in different forms (a normal return, a forced stop, a revert execution, etc.). In all such cases, the slot needs to be considered as a permanent slot so that we can reason later on potential needless write accesses involved in it. The

set S is empty after these instructions since all transient (abstract) slots are made permanent after them. We use the notation Spp to refer to the abstract state computed at program point pp.

Example 5 (slot analysis). The slot analysis of Running2 starts with Spp={∅} at all program points. When it reaches the block that starts at 0x175 (see Ex. 2) S0x175 is {∅} and it remains empty until 0x178, where the baseref of s<sup>3</sup> is read and hence S0x178={{0x178}}. This slot is made permanent when the free memory pointer is updated at 0x17F, thus having S0x17D={{0x178}} and S0x17F={∅}. Following the same pattern, s<sup>4</sup> and s<sup>6</sup> are resp. reserved at instructions 0x178\_0 and 0x178\_1 and closed at 0x17F\_0 and 0x17F\_1 (at the cloned blocks). On the other hand, the baseref of s<sup>5</sup> is read at two consecutive program points (0x4D and 0x5A) and updated at 0x5F, and thus, we have S0x4D={{0x4D}} and the same until S0x5A={{0x4D, 0x5A}} and again the same until S0x5F={∅}. Finally, after the execution of STATICCALL (see Ex. 3) we have three consecutive reads of memh0x40i at 0x114, 0x132 and 0x151 that refer to the same slot s7, which is made permanent at 0x160. Therefore, we have S0x151={{0x114, 0x132, 0x151}} and S0x160 = {∅}.

Using the transfer function, as mentioned in Sec. 3.1, our analysis makes a flow-sensitive traversal of the (context-sensitive) CFG of the program that uses as input for analyzing each block the information inferred for its callers. When a fixpoint is reached, we have an abstract state for each program point that we use to compute the set of abstract slots allocated in the program, named Sall.

Definition 2. The set of allocated abstract slots Sall is defined as Sall = S pp∈P<sup>W</sup> Spp−1, where P<sup>W</sup> is the set of all program points pp:I where I∈K.

Example 6 (Sall computation). With the values of S0x17F-1, S0x17F 0-1, S0x17F 1-1, S0x160-1 and S0x5F-1 from Ex. 5, at the end of the slot analysis of Running2, we have: Sall={{0x178} | {z } , {0x178 0} | {z } , {0x178 1} | {z } , {0x114, 0x132, 0x151} | {z } , {0x5A, 0x4D} | {z } , . . . }.

s3 s4 s6 s7 s5 Note that, the cloning of block 0x175 allows our analysis to detect three different slots, s3, s<sup>4</sup> and s6, for the same program point, 0x178, in the original EVM code.

The next example shows the behavior of the analysis when the program contains loops, and an abstraction is needed for approximating the slots.

Example 7 (loops). Fig. 2 shows the contract Running3 that includes the function explicitOwnershipsOf from the smart contract at [2] (made through a STATICCALL). This function receives an array of token identifiers as argument and returns an array of TokenOwnership structs that is populated invoking the function explicitOwnershipOf from Running2 inside a loop. The slots identified by the analysis for contract Running3 shown in Fig. 2 are: s9, which is created for making a copy of parameter tokenIds to memory; s10, which creates the local array ownerships (L44) that contains the array length and pointers to the structs identified initially by s<sup>11</sup> (and later on by s13); s<sup>12</sup> for STATICCALL input arguments and return data (L46); s<sup>13</sup> which abstracts the structs for storing the STATICCALL output results (L46); and s14, which includes the length of ownership

Fig. 2: Solidity code of contract Caller.

and a copy of s<sup>13</sup> for returning the results (L48). The important point is that, the local array declaration at L44 produces a loop to allocate as many structs as elements are contained in the array. For this reason, s<sup>11</sup> is an abstract slot that represents all TokenOwnership's initially added to the array. Similarly, s<sup>12</sup> and s<sup>13</sup> are created inside the for loop, and each abstract slot represents as many concrete slots as iterations are performed by the loop. Note that, each iteration of the loop creates one instance of s<sup>12</sup> for getting the results from the call, and it is copied later to s<sup>13</sup> and pointed by ownerships (s10).

As notation, we will use a unique numeric identifier (1, 2, . . .) to refer to each abstract slot (represented in Sall as a set) and retrieve it by means of function get id(a), a ∈ Sall. We use A to refer to the set of all such identifiers in the program. Also, given a program point pp with an instruction MLOAD 0x40, we define the function get slots(pp) to retrieve the identifiers of the elements of Sall that might be referenced at pp as follows: get slots(pp) = {id | a ∈ Sall ∧pp ∈ a∧id = get id(a)}.

#### 3.3 Slot Access Analysis

While Sec. 3.2 looked for allocations, the next step of the analysis is the inference of the program points at which the inferred abstract slots might be accessed. To do so, our slot access analysis needs to propagate the references to the abstract slots that are saved at the different positions of the execution stack. Importantly, we keep track, not only of the stack positions, but also, in order to abstract complex data structures stored in memory (e.g., arrays of structs), we need to keep track of the abstract slots that could be saved at memory locations. As seen in Ex. 7, a memory location within a slot might contain a pointer to another memory location of another slot, as it happens when nested data structures are used. Thus, an abstract state is a mapping at which we store the potential slots saved at stack positions or at memory locations within other slots.

Definition 3 (memory analysis abstract state). A memory analysis abstract state is a mapping π of the form T ∪ A 7→ ℘(A).

T is the set containing all stack positions, which we represent by natural numbers from 0 (bottom of the stack) on, and A is the set of abstract slots identifiers computed in Sec. 3.2. We refer to the set of all memory analysis abstract states as AS. Note that, for each entry, we keep a set of potential slots for each stack position because a block might be reached from several blocks with different execution stacks, e.g., in loops or if-then-else structures. In what follows, we assume that, given a value k, the map π returns the empty set when k 6∈ dom(π). The inference is performed by a flow-sensitive analysis (as described in Sec. 3.1) that keeps track of the information about the abstract slots used at any program point by means of the following transfer function.

Definition 4 (memory analysis transfer function). Given an instruction I with n input operands at program point pp and an abstract state π, the memory analysis transfer function τ is defined as a mapping τ :Ins × AS 7→ AS of the form:


t=top(pp) is the numerical position of the top of the stack before executing I.

Let us explain the above definition. The transfer function distinguishes between two different types of MLOAD: (1) accesses to location memh0x40i, which return the baseref of the slots that might be used, taking them from the previous analysis through get slots(p); and (2) other MLOAD instructions, which could potentially return slot baserefs from memory locations. Therefore, we have to consider two possibilities: if we are reading a memory location which reads a generic value (e.g. a number) then π(t) = ∅; if we are reading a memory location that might store an abstract slot, then π(t) contains all abstract slots that might be stored at that memory location. Regarding (3), MSTORE has two operands: the operand at t is the memory address that will be modified by MSTORE, and the operand at t − 1 is the value to be stored in that address. For each element s in π(t), the analysis adds the abstract slots that are in π(t−1). Other instructions that are also treated by the analysis are SWAP\* and DUP\* shown in (4-5), that exchange or copy the elements of the stack that take part in the operation. Finally, all other operations delete the elements of the stack that are no longer used based on the number of elements taken and written to the stack (case 6).

Example 8 (transfer). Now we focus on the analysis of block 0x175, shown in Fig. 3. As we have already explained, this block is responsible for creating the memory needed to work with several structs of type TokenOwnership and it is thus cloned in the CFG. In particular, we focus on the clone 0x175\_1. The analysis of the block starts with a stack of size 7 and includes at positions 3 and 4, the abstract slots s<sup>3</sup> and s4, which were created at L26 and L27 of Fig. 1. At 0x178\_1, memh0x40i is read, and, by means of get slots(0x178 1) and, considering that top(0x178 1)=8, we add to π a new entry 8 7→ s6. At 0x179\_1, 0x180\_1, 0x1AA\_1, 0x1B3\_1 the transfer function duplicates a slot identifier stored in the stack. MSTORE and POP instructions of the example remove a slot identifier from the stack.


Fig. 3: Block of the CFG that reserves memory slot for struct

As it is flow-sensitive, the analysis of each block of the CFG takes as input the join t of the abstract states computed with the transfer function for the blocks that jump to it, and keeps applying the memory analysis transfer function until a fixpoint is reached. The operation A t B is the result of joining, by means of operation ∪, all entries from maps A and B. Operation v is defined as expected, A v B, when B includes entries that are not in dom(A) or when we have an entry v ∈ dom(A) ∩ dom(B) such that A(v) ⊆ B(v). Again, termination of the computation is guaranteed because the domain is finite.

Example 9 (joining abstract states). The EVM code of explicitOwnershipOf of Fig. 1 uses s<sup>5</sup> in both return sentences at L29 and L33 (see Ex. 1). This EVM code has a single return block which is reachable from two different paths from the if statement, and which come with different abstract states: (1) the path that corresponds to L29 comes with π={3 7→ s8}, and the other path (L33) with π={3 7→ s4}. Our analysis joins both abstract states resulting in π={3 7→ {s4, s8}}. Because of this join, we get that the RETURN instruction that comes from lines L29 and L33 might return the content of the slots s<sup>4</sup> or s8.

When the fixpoint is reached, the analysis has computed an abstract state for each program point pp, denoted by πpp in what follows.

Example 10 (complex data structures). The analysis of the code at Fig. 2 shows how it deals with data structures that might contain pointers to other structures, e.g. ownerships. The abstract slot that represents variable ownerships is s10, which is written, by means of MSTORE at two program points, say pp<sup>1</sup> and pp<sup>2</sup> which, resp., come from L44 and L46 of the Solidity code. The input abstract state that reaches pp<sup>1</sup> is {2 7→ s9, 6 7→ s10, 8 7→ s10, 9 7→ s11, 10 7→ s10}, and the transfer function of MSTORE leaves the abstract state as πpp<sup>1</sup> = {2 7→ s9, 6 7→ s10, 8 7→ s10, s<sup>10</sup> 7→ s11}. At this point, we can see that variable ownerships is initialized with empty structs and, to represent it, our analysis includes in π the entry s<sup>10</sup> 7→ s<sup>11</sup> as it is described in instruction MSTORE of the transfer function at Def. 4. The second write to s<sup>10</sup> is performed by another MSTORE instruction at pp2. The input abstract state for pp<sup>2</sup> is {2 7→ s9, 5 7→ s10, 7 7→ s13, 8 7→ s13, 9 7→ s10, s<sup>10</sup> 7→ s11}, and thus we get πpp<sup>2</sup> = {2 7→ s9, 5 7→ s10, 7 7→ s13, s<sup>10</sup> 7→ {s11, s13}}. Interestingly, at pp2, we detect that s<sup>11</sup> might also store the structs returned by the call to c.explicitOwnershipOf(tokenIds[i]), identified by s13, which is added to

s<sup>10</sup> 7→ {s11, s13}. Finally, s<sup>10</sup> is read at the end of the method, returning the set {s11, s13}, to copy the content of ownerships to s14, the slot used in the return.

## 3.4 Inference of Needless Write Memory Accesses

With the results of the previous analysis, we can compute the maps R and W, which are of the form pp 7→ ℘(A) and capture the slots that might be read or written, resp., at the different program points. To do so, as multiple EVM instructions, e.g. RETURN, CALL, LOG, CREATE, ..., might read, or write, memory locations taking the concrete location from the stack, we define functions mr(I) and mw(I) that, given an EVM instruction I, return the position in the stack of the address to be read and written by I, resp. If the instruction does not read/write any memory position, function mr(I) = ⊥/mw(I) = ⊥. For example, mr(MLOAD) = 0 as it reads the top of the stack and mw(MLOAD) = ⊥, or mr(STATICCALL) = 2 and mw(STATICCALL) = 4. Now, we define the read/write maps R/W:

Definition 5 (memory read/write accesses map). Given an EVM program P, such that pp ≡ I ∈ P and being t=top(pp), we define maps R and W as follows:

$$\mathcal{R}(pp) = \begin{cases} \emptyset & mr(I) = \bot \\ \pi\_{pp-1}(t - mr(I)) & otherwise \end{cases} \quad \mathcal{W}(pp) = \begin{cases} \emptyset & mw(I) = \bot \\ \pi\_{pp-1}(t - mw(I)) & otherwise \end{cases}$$

Example 11 (R/W maps). Let us illustrate the computation of R(0x139) and W(0x139), which contains the STATICCALL of Running2. With the analysis information obtained from the analysis we have that top(0x139) = 16 and π0x138 = {3 7→ s3, 4 7→ s4, 7 7→ s6, 10 7→ s7, 12 7→ s7, 14 7→ s7}, thus we get R(0x139) = {s7} and W(0x139) = {s7}, i.e., the slot used for managing the input and the output of the external call. Analogously, we get that R(0x178) = {s3} and W(0x178) = ∅.

The last step of our analysis consists in searching for write accesses to slots which will never be read later. To do so, we use the information computed in R and W. Given the CFG of the program and two program points p and p2, we define function reachable(p, p2), which returns true when there exists a path in the CFG from p to p2. We define the set write leaks N as follows:

Definition 6. Given an EVM program and its W and R, we define N as N = {pw:s | pw ∈ P ∧ s ∈ W(pw) ∧ ¬exists read(pw, s)} where exists read(pw, s) ≡ ∃ pr ∈ dom(R) | s ∈ R(pr) ∧ reachable(pw, pr).

Intuitively, the set N contains those write accesses, taken from W, that are never read by subsequent blocks in the CFG. As both function reachable and the sets W and R are over-approximations, the computation of N provides us those write accesses that can be safely removed, as the next example shows.

Example 12. Our analysis detects that at program points 0x19A, 0x1AB and 0x1B4 there are MSTORE operations that are never read in the subsequent blocks of the CFG. Such operations correspond to the memory initialization of s3, which is performed at L27 of the code of Fig. 1 (see Ex. 2). Given that these write accesses are the only use of the slot, the whole reservation can be safely removed. Moreover, the analysis detects that program points 0x19A\_1, 0x1AB\_1 and 0x1B4\_1, which correspond to the reservation of s<sup>6</sup> performed at L22, are detected as needless. In essence, it means that s<sup>3</sup> and s<sup>6</sup> are allocated and initialized but are never used in the program. Note that, all these program points belong to two blocks cloned: (0x175 and 0x175\_1). However, the three MSTORE operations of the other clone of the same block (0x175\_0), which correspond to the allocation at L28 are not identified as non-read, as they might be used in the return of the function. For this, the precision of the context-sensitive CFG is necessary to identify these MSTORE operations as needless. As a result we cannot eliminate the block because it is needed in one of the clones, but still we can achieve an important optimization on the EVM code by removing the unconditional jumps to this block in the other two cases that would avoid completely the execution of all these instructions (and their corresponding gas consumption [27]).

The soundness of slots and slots access analyses states that, for each concrete slot, there exists an abstract slot in Sall that represents it and, that any access to memory is approximated by an inferred abstract slot. Technical details can be found in an extended report [8].

## 4 Experimental Evaluation

This section reports on the results of the experimental evaluation of our approach, as described in Sec. 3. All components of the analysis are implemented in Python, are open-source, and can be downloaded from github where detailed instructions for its installation and usage are provided<sup>4</sup> . We use external components to build the CFGs (as this is not a contribution of our work). Our analysis tool accepts smart contracts written in versions of Solidity up to 0.8.17 and bytecode for the Ethereum Virtual Machine v1.10.25<sup>5</sup> . The experiments have been performed on an AMD Ryzen Threadripper PRO 3995WX 64-cores and 512 GB of memory, running Debian 5.10.70. In order to experimentally evaluate the analysis, we pulled from etherscan.io [5] the Ethereum contracts bound to the last 5,000 open-source verified addresses whose source code was available on July 14, 2022. From those addresses, the code of 2.18% of them raises a compilation error from solc . For the code bound to the 4,891 remaining addresses, the generation of the CFG (which is not a contribution of this work) timeouts after 120s on 626 of them. Removing such failing cases, we have finally analyzed 19,199 smart contracts, as each address and each Solidity file may contain several contracts in it. Note that 84.86% of the contracts are compiled with the solc version 0.8, presumably with the most advanced compilation techniques. The whole dataset used will be found at the above github link.

In order to be in a worst-case scenario for us, we run the memory analysis after executing the solc optimizer, i.e, we analyze bytecode whose memory usage may have been optimized already by the optimizer available in solc. This will allow us also to see if we can achieve further optimization with our

<sup>4</sup> https://github.com/costa-group/EthIR/tree/memory optimizer/ethir

<sup>5</sup> The latest versions released up to Oct 2022.

approach. Unfortunately, we have not been able to apply our tool after running the super-optimizer GASOL [9], because it does not generate the optimized bytecode but rather it only reports on the gas and/or size gains for each of the blocks. Nevertheless, a detailed comparison of the techniques that GASOL applies and ours is given in Sec. 5, where we justify that GASOL will not find any of our needless accesses. From the 19,199 analyzed contracts, the analysis infers 679,517 abstract memory slots and detects 6,242 needless write memory accesses in 12,803s. These needless accesses occur within the code bound to 780 different addresses, i.e., 15.95% of the analyzed ones.

We have computed the number of needless accesses identified by our analysis grouped by function and the number of different contracts that contain these functions. Some of them such as transferFrom(1736 accesses in 439 contracts), transfer (1745 aacesses in 441 contracts), reflectionFromToken(105 accesses in 6 contracts) or withdraw(54 accesses in 32 contracts) are functions widely used in the implementation of contracts based on ERC tokens. A manual inspection of the 10 most common public functions with the needless accesses inferred has revealed two different sources for them: some of the needles accesses are due to inefficient programming practices, while others are generated by the compiler and could be improved. As regards compiler inefficiencies, we detected bytecode that allocates memory slots that are inaccessible and cannot be used because the baseref to access them is not maintained in the stack. For example, when a struct is returned by a function, it always allocates memory for this data. However, if the return variable is not named in the header of the function, the compiler allocates memory for this data although it will never be accessed. If programmers are aware of this behavior they can avoid such generation of useless memory but, even better, this memory usage patterns can be changed in the compiler. For instance, it is reflected in L22 and L27 in Fig. 1, where the functions do not name the return variable. Hence, the compiler allocates memory for these anonymous data structures which are never used. Similarly, there are various situations involving external calls in which the compiler creates memory that is never used. When there is an external call that does not retrieve any result, the compiler creates two memory slots, one for retrieving the result from the call, and another one for copying a potential result to a memory variable that is never used. Finally, the compiler also creates memory that is never used for low-level plain calls for currency transfer. Even though the contract code does not use the second result returned by the low-level call, the compiler generates code for retrieving it. All these potential optimizations have been detected by means of our inference of needless write accesses and will be communicated to the solc developers.

## 5 Conclusions and Related Work

We have proposed a novel memory analysis for Ethereum smart contracts and have applied it to infer needless write memory accesses. The application of our implementation over more than 19,000 real smart contracts has detected some compilation patterns that introduce needless write accesses and that can be easily changed in the compiler to generate more efficient code. Let us discuss related work along two directions: (1) memory analysis and (2) memory optimization. Regarding (1), we can find advanced points-to analysis developed for Java-like languages [7, 11, 18, 20]. Focusing on EVM, the static modeling of the EVM memory in [16] has some similarities with the memory analysis presented in Secs. 3.2 and 3.3, since in both cases we are seeking to model the memory although with different applications in mind. There are differences on one hand on the type of static analysis used in both cases: [16] is based on a Datalog analysis while we have defined a standard transfer function which is used within a flow-sensitive analysis. More importantly, there are differences on the precision of both analyses. We can accurately model the memory allocated by nested data structures in which the memory contains pointers to other memory slots, while [16] does not capture such type of accesses. This is fundamental to perform memory optimization since, as shown in the running examples of the paper, it allows detecting needless write accesses that otherwise would be missed. Finally, the application of the memory analysis to optimization is not studied in [16], while it is the main focus of our work.

As regards (2), optimizing memory usage is a challenging research problem that requires to precisely infer the memory positions that are being accessed. Such positions sometimes are statically known (e.g., when accessing the EVM free memory pointer) but, as we have seen, often a precise and complex inference is required to figure out the slot being accessed at each memory access bytecode. Recent work within the super-optimizer GASOL [9] is able to perform some memory optimizations at the level of each block of the CFG (i.e., intra-block). of There are three fundamental differences between our work and GASOL: First, GASOL can only apply the optimizations when the memory locations being addressed refer to the same constant direction. In other words, there is no real memory analysis (namely Secs. 3.2 and 3.3). Second, the optimizations are applied only at an intra-block level and hence many optimization opportunities are missed. These two points make a fundamental difference with our approach, since detected optimizable patterns (see Sec. 4) require inter-block analysis and a precise slot access analysis, and hence cannot be detected by GASOL.

Finally, as mentioned in Sec. 1, in addition to dynamic memory, smart contracts also use a persistent memory called storage. Regarding the application of our approach to infer needless accesses in storage, there are two main points. First, there is no need to develop a static analysis to detect the slots in storage, as they are statically known (hence our inference in Sec. 3.2 and 3.3 is not needed), i.e., one can easily know the read and write sets of Def. 6. Thus, the read and write sets of our analysis can be easily defined for storage. The second point is that, as storage is persistent memory, a write storage access is not removable even if there is no further read access within the smart contract, as it needs to be stored for a future transaction. The removable write storage accesses are only those that are rewritten and not read in-between the two write accesses. Including this in our implementation is straightforward. However, this situation is rather unusual, and we believe that very few cases would be found and hence little optimization can be achieved.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Markov Chains/Stochastic Control**

## A Practitioner's Guide to MDP Model Checking Algorithms?

Arnd Hartmanns<sup>1</sup> , Sebastian Junges<sup>2</sup> , Tim Quatmann<sup>3</sup> , and Maximilian Weininger<sup>4</sup>()

<sup>1</sup> University of Twente, Enschede, The Netherlands a.hartmanns@utwente.nl

<sup>2</sup> Radboud University, Nijmegen, The Netherlands sebastian.junges@ru.nl

<sup>3</sup> RWTH Aachen University, Aachen, Germany tim.quatmann@cs.rwth-aachen.de

<sup>4</sup> Technical University of Munich, Munich, Germany maxi.weininger@tum.de

Abstract. Model checking undiscounted reachability and expected-reward properties on Markov decision processes (MDPs) is key for the verification of systems that act under uncertainty. Popular algorithms are policy iteration and variants of value iteration; in tool competitions, most participants rely on the latter. These algorithms generally need worst-case exponential time. However, the problem can equally be formulated as a linear program, solvable in polynomial time. In this paper, we give a detailed overview of today's state-of-the-art algorithms for MDP model checking with a focus on performance and correctness. We highlight their fundamental differences, and describe various optimizations and implementation variants. We experimentally compare floating-point and exact-arithmetic implementations of all algorithms on three benchmark sets using two probabilistic model checkers. Our results show that (optimistic) value iteration is a sensible default, but other algorithms are preferable in specific settings. This paper thereby provides a guide for MDP verification practitioners—tool builders and users alike.

## 1 Introduction

The verification of MDPs is crucial for the design and evaluation of cyber-physical systems with sensor noise, biological and chemical processes, network protocols, and many other complex systems. MDPs are the standard model for sequential decision making under uncertainty and thus at the heart of reinforcement learning. Many dependability evaluation and safety assurance approaches rely in some form on the verification of MDPs with respect to temporal logic properties. Probabilistic model checking [4,5] provides powerful tools to support this task.

The essential MDP model checking queries are for the worst-case probability that something bad happens (reachability) and the expected resource consumption until task completion (expected rewards). These are indefinite (undiscounted)

© The Author(s) 2023

<sup>?</sup> This research was funded by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 101008233 (MISSION), and by NWO VENI grant no. 639.021.754.

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 469–488, 2023. https://doi.org/10.1007/978-3-031-30823-9\_24

horizon queries: They ask about the probability or expectation of a random variable up until an event—which forms the horizon—but are themselves unbounded. Many more complex properties internally reduce to solving either reachability or expected rewards. For example, if the description of something bad is in linear temporal logic (LTL), then a product construction with a suitable automaton reduces the LTL query to reachability [6]. This paper sets out to determine the practically best algorithms to solve indefinite horizon reachability probabilities and expected rewards; our methodology is an empirical evaluation.

MDP analysis is well studied in many fields and has lead to three main types of algorithms: value iteration (VI), policy iteration (PI), and linear programming (LP) [55]. While indefinite horizon queries are natural in a verification context, they differ from the standard problem of e.g. operations research, planning, and reinforcement learning. In those fields, the primary concern is to compute a policy that (often approximately) optimizes the discounted expected reward over an infinite horizon where rewards accumulated in the future are weighted by a discount factor < 1 that exponentially prefers values accumulated earlier.

The lack of discounting in verification has vast implications. The Bellman operation, essentially describing a one-step backward update on expected rewards, is a contraction with discounting, but not a contraction without. This leads to significantly more complex termination criteria for VI-based verification approaches [34]. Indeed, VI runs in polynomial time for every fixed discount factor [49], and similar results are known for PI as well as LP solving with the simplex algorithm [60]. In contrast, VI [9] and PI [20] are known to have exponential worst-case behaviour in the undiscounted case.

So, what is the best algorithm for model checking MDPs? A polynomial-time algorithm exists using an LP formulation and barrier methods for its solution [12]. LP-based approaches (and their extension to MILPs) are also prominent for multi-objective model checking [21], in counterexample generation [23], and for the analysis of parametric Markov chains [16]. However, folklore tells us that iterative methods, in particular VI, are better for solving MDPs. Indeed, variations of VI are the default choice of all model checkers participating in the QComp competition [14]. This uniformity may be misleading. Indeed, for some stochastic game algorithms, using LP to solve the underlying MDPs may be preferential [3, Appendix E.4]. An application in runtime assurance preferred PI for numerical stability [45, Sect. 6]. A toy example from [34] is a famous challenge for VI-based methods. Despite the prominence of LP, the ease of encoding MDPs, and the availability of powerful off-the-shelf LP solvers, many tools did (until very recently) not include MDP model checking via LP solvers.

With this paper, we reconsider the PI and LP algorithms to investigate whether probabilistic model checking focused on the wrong family of algorithms. We report the results of an extensive empirical study with two independent implementations in the model checkers Storm [42] and mcsta [37]. We find that, in terms of performance and scalability, optimistic value iteration [40] is a solid choice on the standard benchmark collection (which goes beyond competition benchmarks) but can be beat quite considerably on challenging cases. We also

emphasize the question of precision and soundness. Numerical algorithms, in particular ones that converge in the limit, are prone to delivering wrong results. For VI, the recognition of this problem has led to a series of improvements over the last decade [8,34,40,19,54,56]. We show that PI faces a similar problem. When using floating-point arithmetic, additional issues may arise [36,59]. Our use of various LP solvers exhibits concerning results for a variety of benchmarks. We therefore also include results for exact computation using rational arithmetic.

Limitations of this study. A thorough experimental study of algorithms requires a carefully scoped evaluation. We work with flat representations of MDPs that fit completely into memory (i.e. we ignore the state space exploration process and symbolic methods). We selected algorithms that are tailored to converge to the optimal value. We also exclude approaches that incrementally build and solve (partial or abstract) MDPs using simulation or model checking results to guide exploration: they are an orthogonal improvement and would equally profit from faster algorithms to solve the partial MDPs. Moreover, this study is on algorithms, not on their implementations. To reduce the impact of potential implementation flaws, we use two independent tools where possible. Our experiments ran on a single type of machine—we do not study the effect of different hardware.

Contributions. This paper contributes a thorough overview on how to modelcheck indefinite horizon properties on MDPs, making MDP model checking more accessible, but also pushing the state-of-the-art by clarifying open questions. Our study is built upon a thorough empirical evaluation using two independent code bases, sources benchmarks from the standard benchmark suite and recent publications, compares 10 LP solvers, and studies the influence of various prominent preprocessing techniques. The paper provides new insights and reviews folklore statements: Particular highlights are a new simple but challenging MDP family that leads to wrong results on all floating-point LP solvers (Section 2.3), a negative result regarding the soundness of PI with epsilon-precise policy evaluators (Section 4), and an evaluation on numerically challenging benchmarks that shows the limitations of value iteration in a practical setting (Section 5.3).

## 2 Background

We recall MDPs with reachability and reward objectives, describe solution algorithms and their guarantees, and address commonly used optimizations.

#### 2.1 Markov Decision Processes

Let D<sup>X</sup> := { d: X → [0, 1] | P <sup>x</sup>∈<sup>X</sup> d(x) = 1 } be the set of distributions over X. A Markov decision process (MDP) [55] is a tuple M = (S, A, δ) with finite sets of states S and actions A, and a partially defined transition function δ : S × A \* D<sup>S</sup> such that A(s) := { a | (s, a) ∈ domain(δ) } 6= ∅ for all s ∈ S. A(s) is the set of enabled actions at state s. δ maps enabled state-action pairs to distributions over successor states. A Markov chain (MC) is an MDP with |A(s)| = 1 for all s. The semantics of an MDP are defined in the usual way, see, e.g. [6, Chapter 10]. A

(memoryless deterministic) policy—a.k.a. strategy or scheduler—is a function π : S → A that, intuitively, given the current state s prescribes what action a ∈ A(s) to play. Applying a policy π to an MDP induces an MC M<sup>π</sup> . A path in this MC is an infinite sequence ρ = s1s<sup>2</sup> . . . with δ(s<sup>i</sup> , π(si))(si+1) > 0. Paths denotes the set of all paths and P π <sup>s</sup> denotes the unique probability measure of M<sup>π</sup> over infinite paths starting in the state s.

A reachability objective Popt(T) with set of target states T ⊆ S and opt ∈ {max, min} induces a random variable X : Paths → [0, 1] over paths by assigning 1 to all paths that eventually reach the target and 0 to all others. Eopt(rew) denotes an expected reward objective, where rew: S → Q<sup>≥</sup><sup>0</sup> assigns a reward to each state. rew(ρ) := P<sup>∞</sup> <sup>i</sup>=1 rew(si) is the accumulated reward of a path ρ = s1s<sup>2</sup> . . . . This yields a random variable X : Paths → Q ∪ {∞} that maps paths to their reward. For a given objective and its random variable X, the value of a state s ∈ S is the expectation of X under the probability measure P π <sup>s</sup> of the the MC induced by an optimal policy π from the set of all policies Π, formally V(s) := optπ∈ΠE π s [X].

## 2.2 Solution Algorithms

Value iteration (VI), e.g. [15], computes a sequence of value vectors converging to the optimum in the limit. In all variants of the algorithm, we start with a function x: S → Q that assigns to every state an estimate of the value. The algorithm repeatedly performs an update operation to improve the estimates. After some preprocessing, this operation has a unique fixpoint when x = V. Thus, value iteration converges to the value in the limit. Variants of VI include interval iteration [34], sound VI [56] and optimistic VI [40]. We do not discuss these in detail, but instead refer to the respective papers.

Linear programming (LP), e.g. [6, Chapter 10], encodes the transition structure of the MDP and the objective as a linear optimization problem. For every state, the LP has a variable representing an estimate of its value. Every state-action pair is encoded as a constraint on these variables, as are the target set or rewards. The unique optimum of the LP is attained if and only if for every state its corresponding variable is set to the value of the state. We provide an in-depth discussion of theoretical and practical aspects of LP in Section 3.

Policy iteration (PI), e.g. [11, Section 4], computes a sequence of policies. Starting with an initial policy, we evaluate its induced MC, improve the policy by switching suboptimal choices and repeat the process on the new policy. As every policy improves the previous one and there are only finitely many memoryless deterministic policies (a number exponential in the number of states), eventually we obtain an optimal policy. We further discuss PI in Section 4.

#### 2.3 Guarantees

Given the stakes in many application domains, we require guarantees about the relation between an algorithm's result v¯ and the true value v. First, implementations are subject to floating-point errors and imprecision [59] unless they use exact (rational) arithmetic or safe rounding [36]. This can result in arbitrary


Table 1: Correct results

differences between v¯ and v. Second are the algorithm's inherent properties: VI is an approximating algorithm that converges to the true value only in the limit. In theory, it is possible to obtain the exact result by rounding after exponentially many iterations [15]; in practice, this results in excessive runtime. Instead, for years, implementations used a naive stopping criterion that could return arbitrarily wrong results [33]. This problem's discovery sparked the development of sound variants of VI [8,34,40,19,54,56], including interval iteration, sound value iteration, and optimistic value iteration. A sound VI algorithm guarantees ε-precise results, i.e. |v − v¯| ≤ ε or |v − v¯| ≤ v · ε. For LP and PI, the guarantees have not yet been thoroughly investigated. Theoretically, both are exact, but implementations are often not. We discuss the problems in Sections 3 and 4.

The handcrafted MC of [33, Figure 2] highlights the lack of guarantees of VI: standard implementations return vastly incorrect results. We extended it with action choices to obtain the MDP M<sup>n</sup> shown in Fig. 1 for n ∈ N, n ≥ 2. It has 2n + 1 states; we compute Pmin({ n }) and Pmax({ n }). The policy that chooses action m wherever possible induces the MC of [33, Figure 2] with (Pmin({ n }),Pmax({ n })) = ( <sup>1</sup> 2 , 1 2 ). In every state s with 0 < s < n, we added the choice of action j that jumps to n and 9n. With that, the (optimal) values over all policies are ( 1 3 , 2 3 ). In VI, starting from value 0 for all states except n, initially taking j everywhere looks like the best policy for Pmax. As updated values slowly propagate, state-by-state, m becomes the optimal choice in all states except −n + 1. We thus layered a "deceptive" decision problem on top of the slow convergence of the original MC. For n = 20, VI with Storm and mcsta deliver the incorrect results (0.247, 0.500). For Storm's PI and various LP solvers, we show in Table 1 the largest n for which they return a ± 0.01-correct result. For larger n, PI and all LP solvers claim ≈ ( 1 2 , 1 2 ) as the correct solution except for Glop and GLPK which only fail for the maximum at the given n; for the minimum, they return the wrong result at n ≥ 29 and 52, respectively. Sound VI algorithms and Storm's exact-arithmetic engine produce (ε-)correct results, though the former at excessive runtime for larger n. We used default settings for all tools and solvers.

## 2.4 Optimizations

VI, LP, and PI can all benefit from the following optimizations:

Graph-theoretic algorithms can be used for qualitative analysis of the MDP, i.e. finding states with value 0 or (only for reachability objectives) 1. These qualitative approaches are typically a lot faster than the numerical computations for quantitative analysis. Thus, we always apply them first and only run the numerical algorithms on the remaining states with non-trivial values.

Topological methods, e.g. [17], do not consider the whole MDP at once. Instead, they first compute a topological ordering of the strongly connected components (SCCs)<sup>5</sup> and then analyze each SCC individually. This can improve the runtime, as we decompose the problem into smaller subproblems. The subproblems can be solved with any of the solution methods. Note that when considering acyclic MDPs, the topological approach does not need to call the solution methods, as the resulting values can immediately be backpropagated.

Collapsing of maximal end components (MECs), e.g., [13,34], transforms the MDP into one with equivalent values but simpler structure. After collapsing MECs, the MDP is contracting, i.e. we almost surely reach a target state or a state with value zero. VI algorithms rely on this property for convergence [34,40,56]. For PI and LP, simplifying the graph structure before applying the solution method can speed up the computation.

Warm starts, e.g. [26,46], may adequately initialize an algorithm, i.e., we may provide it with some prior knowledge so that the computation has a good starting point. We implement warm starts by first running VI for a limited number of iterations and using the resulting estimate to guess bounds on the variables in an LP or a good initial policy for PI. See Sections 3 and 4 for more details.

## 3 Practically solving MDPs using Linear Programs

This section considers the LP-based approach to solving the optimal policy problem in MDPs. To the best of our knowledge, this is the only polynomial-time approach. We discuss various configurations. These configuration are a combination of the LP formulation, the choice of software, and their parameterization.

## 3.1 How to encode MDPs as LPs?

For objective Pmax(T) we formulate the following LP over variables xs, s ∈ S \ T:

$$\begin{aligned} \text{minimize} & \quad \sum\_{s \in \mathsf{S}} x\_s \quad \text{s.t.} \; lb(s) \le x\_s \le ub(s) \quad \text{and} \\ & x\_s \ge \sum\_{s' \in \mathsf{S}/\mathsf{T}} \delta(s, a)(s') \cdot x\_{s'} + \sum\_{t \in \mathsf{T}} \delta(s, a)(t) \quad \text{for all } s \in \mathsf{S} \; | \; \mathsf{T}, a \in \mathsf{A} \end{aligned}$$

<sup>5</sup> A set S <sup>0</sup> ⊆ S is a connected component if for all s, s<sup>0</sup> ∈ S 0 , s can be reached from s 0 . We call S 0 strongly connected component if it is inclusion maximal.

We assume bounds lb(s) = 0 and ub(s) = 1 for s ∈ S \ T. The unique solution η : { x<sup>s</sup> | s ∈ S \ T} → [0, 1] to this LP coincides with the desired objective values η(xs) = V (s). Objectives Pmin(T) and Eopt(rew) have similar encodings: minimizing policies require maximisation in the LP and flipping the constraint relation. Rewards can be added as an additive factor on the right-hand side. For practical purposes, the LP formulation can be tweaked.

The choice of bounds. Any bounds that respect the unique solution will not change the answer. That is, any lb and ub with 0 ≤ lb(s) ≤ V (s) ≤ ub(s) yield a sound encoding. While these additional bounds are superfluous, they may significantly prune the search space. We investigate trivial bounds, e.g., knowing that all probabilities are in [0, 1], bounds from a structural analysis as discussed by [8], and bounds induced by a warm start of the solver. For the latter, if we have obtained values V <sup>0</sup> ≤ V , e.g., induced by a suboptimal policy, then V 0 (s) is a lower bound on the value xs, which is particularly relevant as the LP minimizes.

Equality for unique actions. Markov chains, i.e., MDPs where |A| = 1, can be solved using linear equation systems. The LP encoding uses one-sided inequalities and the objective function to incorporate nondeterministic choices. We investigate adding constraints for all states with a unique action.

$$x\_s \le \sum\_{s' \in S \backslash T} \delta(s, a)(s') \cdot x\_{s'} + \sum\_{t \in T} \delta(s, a)(t) \quad \text{for all } s \in S \backslash T \text{ with } \mathsf{A}(s) = \{a\}$$

These additional constraints may trigger different optimizations in a solver, e.g., some solvers use Gaussian elimination for variable elimination.

A simpler objective. The standard objective assures the solution η is optimal for every state, whereas most invocations require only optimality in some specific states – typically the initial state s<sup>0</sup> or the entry states of a strongly connected component. In that case, the objective may be simplified to optimize only the value for those states. This potentially allows for multiple optimal solutions: in terms of the MDP, it is no longer necessary to optimize the value for states that are not reached under the optimal policy.

Encoding the dual formulation. Encoding a dual formulation to the LP is interesting for mixed-integer extensions to the LP, relevant for computing, e.g., policies in POMDPs [47], or when computing minimal counterexamples [58]. For LPs, due to the strong duality, the internal representation in the solvers we investigated is (almost) equivalent and all solvers support both solving the primal and the dual representation. We therefore do not further consider constructing them.

#### 3.2 How to solve LPs with existing solvers?

We rely on the performance of state-of-the-art LP solvers. Many solvers have been developed and are still actively advanced, see [2] for a recent comparison on general benchmarks. We list the LP solvers that we consider for this work in Table 2. The columns summarize for each solver the type of license, whether it uses exact or floating-point arithmetic, whether it supports multithreading,


Table 2: Available LP solvers ("intr" = interior point)

and what type of algorithms it implements. We also list whether the solver is available from the two model checkers used in this study<sup>6</sup> .

Methods. We briefly explain the available methods and refer to [12] for a thorough treatment. Broadly speaking, the LP solvers use one out of two families of methods. Simplex -based methods rely on highly efficient pivot operations to consider vertices of the simplex of feasible solutions. Simplex can be executed either in the primal or dual fashion, which changes the direction of progress made by the algorithm. Our LP formulation has more constraints than variables, which generally means that the dual version is preferable. Interior methods, often the subclass of barrier methods, do not need to follow the set of vertices. These methods may achieve polynomial time worst-case behaviour. It is generally claimed that simplex has superior average-case performance but is highly sensitive to perturbations, while interior-point methods have a more robust performance.

Warm starts. LP-based model checking can be done using two types of warm starts. Either by providing a (feasible) basis point as done in [26] or by presenting bounds. The former, however, comes with various remarks and limitations, such as the requirement to disable preprocessing. We therefore used warm starts only by using bounds as discussed above.

Multithreading. We generally see two types of parallelisation in LP solvers. Some solvers support a portfolio approach that runs different approaches and finishes with the first one that yields a result. Other solvers parallelize the interior-point and/or simplex methods themselves.

Guarantees for numerical LP solvers. All LP solvers allow tweaking of various parameters, including tolerances to manage whether a point is considered feasible or optimal, respectively. The experiments in Table 1 already indicate that these guarantees are not absolute. A limited experiment indicated that reducing these tolerances towards zero did remove some incorrect results, but not all.

<sup>6</sup> Support for Gurobi, GLPK, and Z3 was already available in Storm. Support for Glop was already available in mcsta. All other solver interfaces have been added.

Exact solving. SoPlex supports exact computations, with a Boost library wrapping GMP rationals [22], after a floating-point arithmetic-based startup phase [27]. While this combination is beneficial for performance in most settings, it leads to crashes for the numerically challenging models. Z3 supports only exact arithmetic (also wrapping GMP numbers with their own interface). We observe that the price of converting large rational numbers may be substantial. SMT solvers like Z3 use a simplex variation [18] tailored towards finding feasible points and in an incremental fashion, optimized for problems with a nontrivial Boolean structure. In contrast, our LP formulation is easily feasible and is a pure conjunction.

## 4 Sound Policy Iteration

Starting with an initial policy, PI-based algorithms iteratively improve the policy based on the values obtained for the induced MC. The algorithm for solving the induced MC crucially affects the performance and accuracy of the overall approach. This section addresses the solvers available in Storm, possible precision issues, and how to utilize a warm start, while Section 5 discusses PI performance<sup>7</sup> .

Markov chain solvers. To solve the induced MC, Storm can employ all linear equation solvers listed in [42] and all implemented variants of VI. In our experiments, we consider (i) the generalized minimal residual method (GMRES) [57] implemented in GMM++ [25], (ii) VI [15] with a standard (relative) termination criterion, (iii) optimistic VI (OVI) [40], and (iv) the sparse LU decomposition implemented in Eigen [31] using either floating-point or exact arithmetic (LU<sup>X</sup>). LU and LU<sup>X</sup> provide exact results (modulo floating-point errors in LU) while OVI yields ε-precise results. VI and GMRES do not provide any guarantees.

Correctness of PI. The accuracy of PI is affected by the MC solver. Firstly, PI cannot be more precise than its underlying solver: the result of PI has the same precision as the result obtained for the final MC. Secondly, inaccuracies by the solver can hide policy improvements; this may lead to premature convergence with a sub-optimal policy. We show that PI can return arbitrarily wrong results—even if the intermediate results are ε-precise:

Consider the MDP in Fig. 2 with objective Pmax({ G }). There is only one nondeterministic choice, namely in state s0. The optimal policy is to pick b, obtaining a value of 0.5. Picking a only yields 0.1. However, when starting from the initial policy π(s0) = a, an ε-precise MC solver may return 0.1 + ε for both s<sup>0</sup> and s<sup>1</sup> and <sup>δ</sup>/<sup>2</sup> + (1 − δ) · 0.1 for s2. This solution is

Fig. 2: Example MDP

indeed ε-precise. However, when evaluating which action to pick in s0, we can choose δ such that a seems to obtain a higher value. Concretely, we require <sup>δ</sup>/<sup>2</sup> + (1 − δ) · 0.1 < 0.1 + ε. For every ε > 0, this can be achieved by setting δ < 2.5 · ε. In this case, PI would terminate with the final policy inducing a severely suboptimal value.

<sup>7</sup> [46] addresses performance in the context of PI for stochastic games.

If every Markov chain is solved precisely, PI is correct. Indeed, it suffices to be certain that one action is better than all others. This is the essence of modified policy iteration as described in [55, Chapters 6.5 and 7.2.6]. Similarly, [46, Section 4.2] suggests to use interval iteration when solving the system induced by the current policy and stopping when the under-approximation of one action is higher than the over-approximation of all other actions.

Warm starts. PI profits from being provided a good initial policy. If the initial policy is already optimal, PI terminates after a single iteration. We can inform our choice of the initial policy by providing estimates for all states as computed by VI. For every state, we choose the action that is optimal according to the estimate. This is a good way to leverage VI's ability to quickly deliver good estimates [40], while at the same time providing the exactness guarantees of PI.

## 5 Experimental Evaluation

To understand the practical performance of the different algorithms, we performed an extensive experimental evaluation. We used three sets of benchmarks: all applicable benchmark instances<sup>8</sup> from the Quantitative Verification Benchmark Set (QVBS) [41] (the qvbs set), a subset of hard QVBS instances (the hard set), and numerically challenging models from a runtime monitoring application [45] (the premise set, named for the corresponding prototype). We consider two probabilistic model checkers, Storm [42] and the Modest Toolset's [37] mcsta. We used Intel Xeon Platinum 8160 systems running 64-bit CentOS Linux 7.9, allocating 4 CPU cores and 32 GB RAM to each experiment unless noted otherwise.

We plot algorithm runtimes in seconds in quantile plots as on the left and scatter plots as on the right of Fig. 3. The former compare multiple tools or configurations; for each, we sort the instances by runtime and plot the corresponding monotonically increasing line. Here, a point (x, y) on the a-line means that the x-th fastest instance solved by a took y seconds. The latter compare two tools or configurations. Each point (x, y) is for one benchmark instance: the x-axis tool took x while the y-axis tool took y seconds to solve it. The shape of points indicates the model type; the mapping from shapes to types is the same for all scatter plots and is only given explicitly in the first one in Fig. 3. Additional plots to support the claims in this section are provided in the appendix of the full version [39] of this paper.

The depicted runtimes are for the respective algorithm and all necessary and/or stated preprocessing, but do not include the time for constructing the MDP state spaces (which is independent of the algorithms). mcsta reports all time measurements rounded to multiples of 0.1 s. We summarize timeouts, outof-memory, errors, and incorrect results as "n/a". Our timeout is 30 minutes for the algorithm and 45 minutes for total runtime including MDP construction. We consider a result v¯ incorrect if |v −v¯| > v · 10−<sup>3</sup> (i.e. relative error 10−<sup>3</sup> ) whenever a reference result v is available. We however do not flag a result as incorrect if

<sup>8</sup> A benchmark instance is a combination of model, parameter valuation, and objective.

Fig. 3: Comparison of LP solver runtime on the qvbs set

v and v¯ are both below 10<sup>−</sup><sup>8</sup> (relevant for the premise set). Nevertheless, we configure the (unsound) convergence threshold for VI as 10−<sup>6</sup> relative; among the sound VI algorithms, we include OVI, with a (sound) stopping criterion of relative 10−<sup>6</sup> error. To only achieve the 10−<sup>3</sup> precision we actually test, OVI could thus be even faster than it appears in our plots. We make this difference to account for the fact that many algorithms, including the LP solvers, do not have a sound error criterion. We mark exact algorithms/solvers that use rational arithmetic with a superscript <sup>X</sup>. The other configurations use floating-point arithmetic (fp).

#### 5.1 The QVBS Benchmarks

The qvbs set comprises all QVBS benchmark instances with an MDP, Markov automaton (MA), or probabilistic timed automaton (PTA) model<sup>9</sup> and a reachability or expected reward/time objective that is quantitative, i.e. not a query that yields a zero or one probability. We only consider instances where both Storm and mcsta can build the explicit representation of the MDP within 15 minutes. This yields 367 instances. We obtain reference results for 344 of them from either the QVBS database or by using one of Storm's exact methods. We found all reference results obtained via different methods to be consistent.

For LP, we have various solvers with various parameters each, cf. Section 3. For conciseness, we first compare all available LP solvers on the qvbs set. For the bestperforming solver, we then evaluate the benefit of different solver configurations. We do the same for the choice of Markov chain solution method in PI. We then focus on these single, reasonable, setups for LP and PI each in more detail.

LP solver comparison. The left-hand plot of Fig. 3 summarizes the results of our comparison of the different LP solvers. Subscripts <sup>s</sup> and <sup>m</sup> indicate whether the solver is embedded in either Storm or mcsta. We apply no optimizations or

<sup>9</sup> MA and PTA are converted to MDP via embedding and digital clocks [48].

Fig. 4: Performance impact of LP problem formulation variants (using Gurobis)

reductions to the MDPs except for the precomputation of probability-0 states (and in Storm also of probability-1 states), and use the default settings for all solvers, with the trivial variable bounds [0, 1] and [0, ∞) for probabilities and expected rewards, respectively. We include VI as baseline. In Table 3, we summarize the results.

In terms of performance and scalability, Gurobi solves the highest number of benchmarks in any given time budget, closely followed by COPT. CPLEX, HiGHS, and Mosek make up a middle-class group. While the exact solver Z3 is very slow, SoPlex's exact mode actually competes with some fp solvers. However, the quantile plots

do not tell the whole story. On the right of Fig. 3, we compare COPT and Gurobi directly: each has a large number of instances on which it is (much) better.

In terms of reliability of results, the exact solvers as expected produce no incorrect results; so does the slowest fp solver, lp\_solve. COPT, CPLEX, HiGHS, Mosek, and fp-SoPlex perform badly in this metric, producing more errors than VI. Interestingly, these are mostly the faster solvers, the exception being Gurobi.

Overall, Gurobi achieves highest performance at decent reliability; in the remainder of this section, we thus use Gurobi<sup>s</sup> whenever we apply non-exact LP. LP solver tweaking. Gurobi can be configured to use an "auto" portfolio approach, potentially running multiple algorithms concurrently on multiple threads, a primal or a dual simplex algorithm, or a barrier method algorithm. We compared each option with 4 threads and found no significant performance difference. Similarly, running the auto method with 1, 4, and 16 threads (only here, we allocate 16 threads per experiment) also failed to bring out noticeable performance differences. Using more threads results in a few more out-of-memory errors, though. We thus fix Gurobi on auto with 4 threads.

Fig. 4 shows the performance impact of supplying Gurobi with more precise bounds on the variables for expected reward objectives using methods from

Table 3: LP summary


Fig. 5: Comparison of MDP model checking algorithms on the qvbs set

[8,51] ("bounds" instead of "simple"), of optimizing only for initial state ("init") instead of the sum over all states ("all"), and of using equality ("eq") instead of less-/greater-than-or-equal ("ineq") for unique action states. More precise bounds yield a very small improvement at essentially no cost. Optimizing for the initial state only results in a little better overall performance (in the "pocket" in the quantile plot around x = 315 that is also clearly visible in the scatter plot). However, it also results in 2 more incorrect results in the qvbs set. Using equality for unique actions noticeably decreases performance and increases the incorrect result count by 9 instances. For all experiments that follow, we thus use the more precise bounds, but do not enable the other two optimizations.

PI methods comparison. The main choice in PI is which algorithm to use to solve the induced Markov chains. On the right, we show the performance of the different algorithms available in Storm (cf. Section 4). LU<sup>X</sup> yields a fully exact PI. This interestingly performs better than the fp version, potentially because fp errors induce spurious policy

changes. The same effect likely also hinders the use of OVI, whereas VI leads to good performance. Nevertheless, gmres is best overall, and thus our choice for all following experiments with non-exact PI. VI and gmres yield 6 and 4 incorrect results, respectively. OVI and the exact methods are always correct on this benchmark set.

Best MDP algorithms for QVBS. We now compare all MDP model checking algorithms on the qvbs set: with floating-point numbers, LP and PI configured as described above, plus unsound VI, sound OVI, and the warm-start variants of PI and LP denoted "VI2PI" and "VI2LP", respectively. Exact results are provided by rational search (RS, essentially an exact version of VI) [50], PI with exact LU, and LP with exact solvers (SoPlex and Z3). All are implemented in Storm.

In a first experiment, we evaluated the impact of using the topological approach and of collapsing MECs (cf. Section 2.4). The results, for which we omit plots, are that the topological approach noticeably improves performance and scalability for all algorithms, and we therefore always use it from now on. Collapsing MECs is necessary to guarantee termination of OVI, while for the

Fig. 6: Additional direct performance comparisons

Fig. 7: Comparison of MDP model checking algorithms on the hard subset

other algorithms it is a potential optimization; however we found it to overall have a minimal positive performance impact only. Since it is required by OVI and does not reduce performance, we also always use it from now on.

Fig. 5 shows the complete comparison of all the methods on the qvbs set, for fp algorithms on the left and exact solutions on the right. Among the fp algorithms, OVI is clearly the fastest and most scalable. VI is somewhat faster but incurs several incorrect results that diminish its appearance in the quantile plot. OVI is additionally special among these algorithms in that it is sound, i.e. provides guaranteed ε-correct results—though up to fp rounding errors, which can be eliminated following the approach of [36]. On the exact side, PI with an inexact-VI warm start works best. The scatter plots in Fig. 6(a) shows the performance impact of computing an exact instead of an approximate solution.

#### 5.2 The Hard QVBS Benchmarks

The QVBS contains many models built for tools that use VI as default algorithm. The other algorithms may actually be important to solve key challenging instances where VI/OVI perform badly. This contribution could be hidden in the sea of instances trivial for VI. We thus zoom in on a selection of QVBS instances that appear "hard" for VI: those where VI takes longer than the prior MDP state

Fig. 8: Comparison of MDP model checking algorithms on the premise set

space construction phase in both Storm and mcsta, and additionally both phases together take at least 1 s. These are 18 of the previously considered 367 instances.

In Fig. 7, we show the behaviour of all the algorithms on this hard subset. OVI again works better than VI due to the incorrect results that VI returns. We see that the performance and scalability gap between the algorithms has narrowed; although OVI still "wins", LP in particular is much closer than on the full qvbs set. We also investigated the LP outcomes with solvers other than Gurobi: even on this set, Gurobi and COPT remain the fastest and most scalable solvers. With mcsta, in the basic configuration, they solve 16 and 17 instances, the slowest taking 835 s and 1334 s, respectively; with the topological optimization, the numbers become 17 and 15 instances with the slowest at 1373 s and 1590 s seconds. We show the detailed comparison of OVI and LP in Fig. 6(c), noting that there are a few instances where LP is much faster, and repeat the comparison between the best fp and exact algorithms (Fig. 6(b)).

#### 5.3 The Runtime Monitoring Benchmarks

While the QVBS is intentionally diverse, our third set of benchmarks is intentionally focused: We study 200 MDPs from a runtime monitoring study [45]. The original problem is to compute the normalized risk of continuing to operate the system being monitored subject to stochastic noise, unobservable and uncontrollable nondeterminism, and partial state observations. This is a query for a conditional probability. It is answered via probabilistic model checking by unrolling an MDP model along an observed history trace of length n ∈ { 50, . . . , 1000 } following the approach of Baier et al. [7]. The MDPs contain many transitions back to the initial state, ultimately resulting in numerically challenging instances (containing structures similar to the one of M<sup>n</sup> in Section 2.3). We were able to compute a reference result for all instances.

Fig. 8 compares the different MDP model checking algorithms on this set. In line with the observations in [45], we see very different behaviour compared to the QVBS. Among the fp solutions on the left, LP with Gurobi terminates very quickly (under 1 s), and either produces a correct (155 instances) or a completely incorrect result (mostly 0, on 45 instances). VI behaves similarly, but is slower. OVI, in contrast, delivers no incorrect result, but instead fails to terminate on all but 116 instances. In the exact setting, warm starts using VI inherit its relative

slowness and consequently do not pay off. Exact PI outperforms both exact LP solvers. In the case of exact SoPlex, out of the 112 instances it does not manage to solve, 98 are crashes likely related to a confirmed bug in its current version.

The premise set highlights that the best MDP model checking algorithm depends on the application. Here, in the fp case, LP appears best but produces unreliable (incorrect) results; the seemingly much worse OVI at least does not do so. Given the numeric challenge, an exact method should be chosen, and we show that these actually perform well here.

## 6 Conclusion

We thoroughly investigated the state of the art in MDP model checking, showing that there is no single best algorithm for this task. For benchmarks which are not numerically challenging, OVI is a sensible default, closely followed by PI and LP with a warm start—although using the latter two means losing soundness as confirmed by a number of incorrect results in our experiments. For numerically hard benchmarks, PI and LP as well as computing exact solutions are more attractive, and clearly preferable in combination. Overall, although LP has the superior (polynomial) theoretical complexity, in our practical evaluation, it almost always performs worse than the other (exponential) approaches. This is even though we use modern commercial solvers and tune both the LP encoding of the problem as well as the solvers' parameters. While we observed the behaviour of the different algorithms and have some intuition into what makes the premise set hard, an entire research question of its own is to identify and quantify the structural properties that make a model hard.

Our evaluation also raises the question of how prevalent MDPs that challenge VI are in practice. Aside from the premise benchmarks, we were unable to find further sets of MDPs that are hard for VI. Notably, several stochastic games (SGs) difficult for VI were found in [46]; the authors noted that using PI for the SGs was better than applying VI to the SGs. However, when we extracted the induced MDPs, we found them all easy for VI. Similarly, [3] used a random generation of SGs of at most 10,000 states, many of which were challenging for the SG algorithms. Yet the same random generation modified to produce MDPs delivered only MDPs easily solved in seconds, even with drastically increased numbers of states. In contrast, Alagöz et al. [1] report that their random generation returned models where LP beat PI. However, their setting is discounted, and their description of the random generation was too superficial for us to be able to replicate it. We note that, in several of our scatter plots, the MA instances from the QVBS (where we check the embedded MDP) appeared more challenging overall than the MDPs. We thus conclude this paper with a call for challenging MDP benchmarks—as separate benchmark sets of unique characteristics like premise, or for inclusion in the QVBS.

Data availability statement. The datasets generated and analysed in this study and code to regenerate them are available in the accompanying artifact [38]. For Storm, our code builds on version 1.7.0. We used mcsta version 3.1.213.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Correct Approximation of Stationary Distributions**

Tobias Meggendorfer()

Institute of Science and Technology Austria, 3400 Klosterneuburg, Austria tobias.meggendorfer@ista.ac.at

**Abstract.** A classical problem for Markov chains is determining their stationary (or steady-state) distribution. This problem has an equally classical solution based on eigenvectors and linear equation systems. However, this approach does not scale to large instances, and iterative solutions are desirable. It turns out that a naive approach, as used by current model checkers, may yield completely wrong results. We present a new approach, which utilizes recent advances in partial exploration and mean payof computation to obtain a correct, converging approximation.

## **1 Introduction**

*Discrete-time Markov chains* (MCs) are an elegant and standard framework to describe stochastic processes, with a vast area of applications such as computer science [4], biology [28], epidemiology [13], and chemistry [12], to name a few. In a nutshell, MC comprise a set of states and a transition function, assigning to each state a distribution over successors. The system evolves by repeatedly drawing a successor state from the transition distribution of the current state. This can, for example, model communication over a lossy channel, a queuing network, or populations of predator and prey which grow and interact randomly. For many applications, the *stationary distribution* of such a system is of particular interest. Intuitively, this distribution describes in which states the system is in after an "infnite" number of steps. For example, in a chemical reaction network this distribution could describe the equilibrium states of the mixture.

Traditionally, the stationary distribution is obtained by computing the dominant eigenvector for particular matrices and solving a series of linear equation systems. This approach is appealing in theory, since it is polynomial in the size of the considered Markov chain. Moreover, since linear algebra is an intensely studied feld, many optimizations for the computations at hand are known.

In practice, these approaches however often turn out to be insufcient. Realworld models may have millions of states, often ruling out exact solution approaches. As such, the attention turns to iterative methods. In particular, the popular model checker PRISM [21] employs the *power method* (or *power iteration*) to approximate the stationary distribution. Similar to many other problems on Markov chains, such iterative methods have an exponential worst-case, however obtain good results quickly on many models. (Models where iterative methods indeed converge slowly are called *stif*.) However, as we show in this work, the

© The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 489–507, 2023.

https://doi.org/10.1007/978-3-031-30823-9\_25

"absolute change"-criterion used by PRISM to stop the iteration is incorrect. In particular, the produced results may be arbitrarily wrong already on a model with only four states. In [14,7] the authors discuss a similar issue for the problem of *reachability*, also rooted in an incorrect absolute change stopping criterion, and provide a solution through converging lower and *upper* bounds. In our case, the situations is more complicated. The convergence of the power method is quite difcult to bound: A good (and potentially tight) a-priori bound is given by the ratio of frst and second eigenvalues, which however is as hard to determine as solving the problem itself. In the case of MC, only a crude bound on this ratio can be obtained easily, which gives an exponential bound on the number of iterations required to achieve a given precision. More strikingly, in contrast to reachability, there is to our knowledge no general *adaptive* stopping criterion for power iteration, i.e. a way to check whether the current iterates are already close to the correct result. Thus, one would always need to iterate for as many steps as given by the a-priori bound to obtain guarantees on the result. In summary, exact solution approaches do not scale well, and the existing iterative approach may yield wrong results or requires an intractable number of steps.

Another, orthogonal issue of the mentioned approaches is that they construct the *complete* system, i.e. determine the stationary distribution for each state. However, when we fgure out that, for example, the stationary distribution has a value of at least 99% for one state, all other states can have at most 1% in total. In case we are satisfed with an *approximate* solution, we could already stop the computation here, without investigating any other state. Inspired by the results of [7,18], we thus also want to fnd such an approximate solution, capable of identifying the relevant parts of the system and only constructing those.

## **1.1 Contributions**

In this work, we address all the above issues. To this end, we


#### **1.2 Related Work**

Most related is the work of [30], which also try to identify the most relevant parts of the system, however they employ the special structure given by cellular processes to fnd these regions and estimate the subsequent approximation error. Many other works deal with special cases, such as queueing models [1,17], time-reversible chains [8], or positive rows (all states have a transition to one particular state) [9,11,27]. In contrast, our methods aim to deal with general Markov chains. We highlight that for the "positive row" case, [11] also provides converging bounds, however through a diferent route. Another topic of interest are continuous time Markov chains, where abstraction- and truncation-based algorithms are applicable [20,3] and computation of the stationary distribution can be used for time-bounded reachability [16].

## **2 Preliminaries**

As usual, N and R refer to the (positive) natural numbers and real numbers, respectively. For a set *S*, *S* denotes its complement, while *S <sup>⋆</sup>* and *S <sup>ω</sup>* refer to the set of fnite and infnite sequences comprising elements of *S*, respectively. We write 1*S*(*s*) = 1 if *s* ∈ *S* and 0 otherwise for the *characteristic function* of *S*.

We assume familiarity with basic notions of probability theory, e.g., *probability spaces*, *probability measures*, and *measurability*; see e.g. [6] for a general introduction. A *probability distribution* over a countable set *X* is a mapping *d* : *X* → [0*,* 1], such that P *<sup>x</sup>*∈*<sup>X</sup> d*(*x*) = 1. Its *support* is denoted by supp(*d*) = {*x* ∈ *X* | *d*(*x*) *>* 0}. D(*X*) denotes the set of all probability distributions on *X*. Some event happens *almost surely* (a.s.) if it happens with probability 1.

The central object of interest are Markov chains, a classical model for systems with stochastic behaviour: A (discrete-time time-homogeneous) *Markov chain (MC)* is a tuple M = (*S, δ*), where *S* is a fnite set of *states*, and *δ* : *S* → D(*S*) is a *transition function* that for each state *s* yields a probability distribution over successor states. We deliberately exclude the explicit defnition of an initial state. We direct the interested reader to, e.g., [4, Sec. 10.1], [29, App. A], or [19] for further information on Markov chains and related notions.

For ease of notation, we write *δ*(*s, s*′ ) instead of *δ*(*s*)(*s* ′ ), and, given a function *f* : *S* → R mapping states to real numbers, we write *δ*(*s*)⟨*f*⟩ := P *s* ′∈*S δ*(*s, s*′ ) · *f*(*s* ′ ) to denote the weighted sum of *f* over the successors of *s*.

We always assume an arbitrary but fxed numbering of the states and identify a state with its respective number. For example, given a vector *v* ∈ R <sup>|</sup>*S*<sup>|</sup> and a state *s* ∈ *S*, we may write *v*[*s*] to denote the value associated with *s* by *v*. In this way, a function *v* : *S* → R is equivalent to a vector *v* ∈ R |*S*| .

For a set of states *R* ⊆ *S* where no transitions leave *R*, i.e. *δ*(*s, s*′ ) = 0 for all *s* ∈ *R*, *s* ′ ∈ *S* \ *R*, we defne the *restricted Markov chain* M|*<sup>R</sup>* := (*R, δ*|*R*) with *δ*|*<sup>R</sup>* : *R* → D(*R*) copying the values of *δ*, i.e. *δ*|*R*(*s, s*′ ) = *δ*(*s, s*′ ) for all *s, s*′ ∈ *R*.

*Paths* An *infnite path ρ* in a Markov chain is an infnite sequence *ρ* = *s*1*s*<sup>2</sup> · · · ∈ *S <sup>ω</sup>*, such that for every *i* ∈ N we have that *δ*(*s<sup>i</sup> , s<sup>i</sup>*+1) *>* 0. We use *ρ*(*i*) to refer to the *i*-th state *s<sup>i</sup>* in a given infnite path. We denote the set of all infnite paths of a Markov chain M by PathsM. Observe that in general Paths<sup>M</sup> is a proper subset of *S <sup>ω</sup>*, as we imposed additional constraints. A Markov chain together with an initial state *s*ˆ ∈ *S* induces a unique probability measure PrM*,s*<sup>ˆ</sup> over infnite paths [4, Sec. 10.1]. Given a measurable random variable *f* : Paths<sup>M</sup> → R, we write EM*,s*ˆ[*f*] := R *<sup>ρ</sup>*∈Paths *f*(*ρ*) *d*PrM*,s*<sup>ˆ</sup> to denote its expectation w.r.t. this measure.

*Reachability* An important tool in the following is the notion of *reachability probability*, i.e. the probability that the system, starting from a state *s*ˆ, will eventually reach a given set *T*. Formally, for a Markov chain M and set of states *T*, we defne the set of runs which reach *T* (i) at step *n* by ♢ <sup>=</sup>*<sup>n</sup>T* := {*ρ* ∈ Paths<sup>M</sup> | *ρ*(*n*) ∈ *T*} and (ii) eventually by ♢*T* = S<sup>∞</sup> *<sup>i</sup>*=1 ♢ <sup>=</sup>*<sup>i</sup>T*. (For a measurability proof see e.g. [4, Chp. 10].) For a state *s*ˆ, the probability to reach *T* is given by PrM*,s*ˆ[♢*T*].

Classically, the reachability probability can be determined by solving a linear equation system, as follows. For a fxed target set *T*, let *S*<sup>0</sup> be all states that cannot reach *T*. Note that *S*<sup>0</sup> can be determined by simple graph analysis. Then, the reachability probability PrM*,s*ˆ[♢*T*] is the unique solution of [4, Thm. 10.19]

$$f(s) = 1 \text{ if } s \in T, \quad 0 \text{ if } s \in S\_0, \quad \text{and} \quad \delta(s)\langle f \rangle \text{ otherwise.} \tag{1}$$

*Value Iteration* A classical tool to deal with Markov chains is *value iteration* (VI) [5]. It is a simple yet surprisingly efcient and extendable approach to solve a variety of problems. At its heart, VI relies, as the name suggests, on iteratively applying an operation to a value vector. This operation often is called "Bellman backup" or "Bellman update", usually derived from a fxed-point characterization of the problem at hand. Thus, VI often can be viewed as fxed point iteration. For reachability, inspired by Eq. (1), we start from *v*1[*s*] = 0 and iterate

$$v\_{k+1}[s] = 1 \text{ if } s \in T, \quad 0 \text{ if } s \in S\_0, \quad \text{and} \quad \delta(s) \langle v\_k \rangle \text{ otherwise.} \tag{2}$$

This iteration monotonically converges to the true value in the limit from below [4, Thm. 10.15], [29, Thm. 7.2.12]. Convergence up to a given precision may take exponential time [14, Thm. 3], but in practice VI often is much faster than methods based on equation solving. For further details, see [26, App. A.2].

*Strongly Connected Components* A non-empty set of states *C* ⊆ *S* in a Markov chain is *strongly connected* if for every pair *s, s*′ ∈ *C* there is a non-empty fnite path from *s* to *s* ′ . Such a set *C* is a *strongly connected component* (SCC) if it is inclusion maximal, i.e. there exists no strongly connected *C* ′ with *C* ⊊ *C* ′ . SCCs are disjoint, each state belongs to at most one SCC. An SCC is *bottom* (BSCC) if additionally no path leads out of it, i.e. for all *s* ∈ *C, s*′ ∈ *S* \ *C* we have *δ*(*s, s*′ ) = 0. The set of BSCCs in an MC M is denoted by BSCC(M) and can be determined in linear time by, e.g., Tarjan's algorithm [32].

The bottom components fully capture the limit behaviour of any Markov chain. Intuitively, the following statement says that (i) with probability one a run of a Markov chain eventually forever remains inside one single BSCC, and (ii) inside a BSCC, all states are visited infnitely often with probability one.

**Lemma 1 ([4, Thm. 10.27]).** *For any MC* M *and state s, we have*

PrM*,s*[{*ρ* | ∃*R<sup>i</sup>* ∈ BSCC(M)*.*∃*n*<sup>0</sup> ∈ N*.*∀*n > n*0*.ρ*(*n*) ∈ *Ri*}] = 1*.*

*For any BSCC R* ∈ BSCC(M) *and states s, s*′ ∈ *R, we have* PrM*,s*[♢{*s* ′}] = 1*.*

*Stationary Distribution* Given a state *s*ˆ, the *stationary distribution* (also known as *steady-state* or *long-run distribution*) of a Markov chain intuitively describes, for each state *s*, the probability for the system to be at this particular state at an

**Fig. 1.** Example MC to demonstrate the stationary distribution. We have that *π* ∞ <sup>M</sup>*,s* = {*p* 7→ <sup>1</sup> 2 *, s* 7→ 0*, q*<sup>1</sup> 7→ <sup>1</sup> 2 · 1 6 *, q*<sup>2</sup> 7→ <sup>1</sup> 2 · 5 6 }.

arbitrarily chosen step "at infnity". There are several ways to defne this notion. In particular, there is a subtle diference between the *limiting* and *stationary distribution*, which however coincide for *aperiodic* MC. For the sake of readability, we omit this distinction and assume w.l.o.g. that all MCs we deal with are aperiodic. See [26, App. A.1] for further discussion. Our defnition follows the view of [4, Def. 10.79]; see [29, Sec. A.4] for a diferent approach.

**Defnition 1.** *Fix a Markov chain* M = (*S, δ*) *and initial state s*ˆ*. Let π n* M*,s*ˆ (*s*) := PrM*,s*ˆ[♢ <sup>=</sup>*<sup>n</sup>*{*s*}] *the probability that the system is at state s in step n. Then, π*<sup>∞</sup> M*,s*ˆ (*s*) := lim*n*→∞ 1 *n* P*<sup>n</sup> <sup>i</sup>*=1 *π i* M*,s*ˆ (*s*) *is the stationary distribution of* M*.*

See Fig. 1 for an example. Whenever the reference is clear from context, we omit the respective subscripts from *π*<sup>∞</sup> M*,s*ˆ .

We briefy recall the classical approach to compute stationary distributions (see e.g. [19, Sec. 4.7]). By Lemma 1, almost all runs eventually end up in a BSCC. Thus, *π*<sup>∞</sup>(*s*) = 0 for all states *s* not in a BSCC, or, dually, P *<sup>s</sup>*∈*<sup>B</sup> π*<sup>∞</sup>(*s*) = 1 for *B* = S *<sup>R</sup>*∈BSCC(M) *R*. Moreover, once in a BSCC, we always obtain the same stationary distribution, irrespective of through which state we entered the BSCC. Formally, for each BSCC *R* ∈ BSCC(M) and *s, s*′ ∈ *R*, we have that *π*<sup>∞</sup> <sup>M</sup>*,s* = *π*<sup>∞</sup> <sup>M</sup>*,s*′ = *π*<sup>∞</sup> M|*R,s* , i.e. each BSCC *R* has a unique stationary distribution, which we denote by *π*<sup>∞</sup> *<sup>R</sup>* . Note that supp(*π*<sup>∞</sup> *<sup>R</sup>* ) = *R*, i.e. *π*<sup>∞</sup> *<sup>R</sup>* (*s*) ̸= 0 if and only if *s* ∈ *R*. Together, we observe that the stationary distribution of a Markov chain decomposes into (i) the steady state distribution in each BSCC and (ii) the probability to end up in a particular BSCC. More formally, for any state *s* ∈ *S*

$$\pi\_{\mathsf{M},\vec{s}}^{\infty}(s) = \sum\_{R \in \text{BSCC}(\mathsf{M})} \mathsf{Pr}\_{\mathsf{M},\vec{s}}[\Diamond R] \cdot \pi\_{R}^{\infty}(s). \tag{3}$$

Consider the example of Fig. 1: We have two BSCCs, {*p*} and {*q*1*, q*2}, which both are reached with probability <sup>1</sup> 2 , respectively. The overall distribution *π*<sup>∞</sup> M*,s* then is obtained from *π*<sup>∞</sup> {*p*} = {*p* 7→ 1} and *π*<sup>∞</sup> {*q*1*,q*2} <sup>=</sup> {*q*<sup>1</sup> 7→ <sup>1</sup> 6 *, q*<sup>2</sup> 7→ <sup>5</sup> 6 }.

As mentioned, we can compute reachability probabilities in Markov chains by solving Eq. (1). Thus, the remaining concern is to compute *π*<sup>∞</sup> *<sup>R</sup>* , i.e. the stationary distribution of M|*R*. In this case, i.e. Markov chains comprising a single BSCC, the steady state distribution is the unique fxed point of the transition function (up to rescaling). By defning the row transition matrix of M as *Pi,j* = *δ*(*i, j*), we can reformulate this property in terms of linear algebra. In particular, we have that *P* · *π*<sup>∞</sup> *<sup>R</sup>* = *π*<sup>∞</sup> *<sup>R</sup>* , or, in other words, (*P* − *I*) · *π*<sup>∞</sup> *<sup>R</sup>* = *⃗*0, where *I* is an appropriately sized identity matrix [29, Thm. A.2]. This equation again can be solved by classical methods from linear algebra. In summary, we (i) compute BSCC(M), (ii) for each BSCC *R*, compute *π*<sup>∞</sup> *<sup>R</sup>* and PrM*,s*ˆ[♢*R*], and (iii) combine according to Eq. (3).

However, as also mentioned in the introduction, precisely solving linear equation systems may not scale well, both due to time as well as memory constraints. Thus, we also are interested in relaxing the problem slightly and instead *approximating* the stationary distribution up to a given precision of *ε >* 0.

**Problem Statement** Given a Markov chain M and precision requirement *ε >* 0, compute bounds *l, u* : *S* → [0*,* 1] such that (i) max*s*∈*<sup>S</sup> u*(*s*) − *l*(*s*) ≤ *ε* and (ii) for all *s* ∈ *S* we have *l*(*s*) ≤ *π*<sup>∞</sup> M*,s*ˆ (*s*) ≤ *u*(*s*).

*Approximate Solutions* Aiming for approximations is not a new idea; to achieve practical performance, current model checkers employ approximate, iterative methods by default for most queries (typically a variant value iteration). In particular, this also is the case for stationary distribution: Instead of solving the equation system for each BSCC *R* precisely, we can approximate the solution by, e.g., the *power method*. This essentially means to repeatedly apply the transition matrix (of the model restricted to the BSCC) to an initial vector *v*0, i.e. iterating *v<sup>n</sup>*+1 = *P<sup>R</sup>* · *v<sup>n</sup>* (or *v<sup>n</sup>*+1 = *P n <sup>R</sup>* · *v*1). Similarly, the reachability probability for each BSCC then also is approximated by value iteration.

It is known that (for aperiodic MC) lim*n*→∞ *v<sup>n</sup>* = *π*<sup>∞</sup> *<sup>R</sup>* (see e.g. [31,16,27]), however convergence up to a precision of *ε* may take exponential time in the worst case. Moreover, there is no known stopping criterion which allows us to detect that we have converged and stop the computation early. Yet, similar to reachability [7,14], current model checkers employ this method without a sound stopping criterion, leading to potentially arbitrarily wrong results, as we show in our evaluation (Fig. 2). See [16] for a related, in-depth discussion of these issues in the context of CTMC.

We thus want to fnd efcient methods to derive safe bounds on the stationary distribution of a BSCC with a correct stopping criterion and combine it with correct reachability approximations to obtain an overall fast and sound approximation. To this end, we exploit two further concepts.

*Partial Exploration* Recent works [7,2,18,24] demonstrate the applicability of *partial exploration* to a variety of problems associated with probabilistic systems such as reachability. Essentially, the idea is to "omit" parts of the system which can be proven to be irrelevant for the result, instead focussing on important areas of the system. Of course, by omitting parts of the system, we may incur a small error. As such, these approaches naturally aim for approximate solutions.

*Mean payof* We make use of another property, namely *mean payof* (also known as *long-run average reward*). We provide a brief overview and direct to e.g. [29, Chp. 8 & 9] or [2] for more information. Mean payof is specifed by a Markov chain and a *reward function r* : *S* → R, assigning a reward to each state. Given an infnite path *ρ* = *s*1*s*<sup>2</sup> · · · , this naturally induces a stream of rewards *r*(*ρ*) := *r*(*s*1)*r*(*s*2)· · · . The mean payof of this path then equals the average reward obtained in the limit, mp′ *r* (*ρ*) := lim inf*n*→∞ 1 *n* P*<sup>n</sup> <sup>i</sup>*=1 *r*(*si*). (The limit

might not be defned for some paths, hence considering the lim inf is necessary.) Finally, the mean payof of a state *s* is the *expected mean payof* according to PrM*,s*, i.e. mp*<sup>r</sup>* (*s*) := EM*,s*[mp′ *r* ].

Classically, mean payof is computed by solving a linear equation system [29, Thm. 9.1.2]. Instead, we can also employ value iteration to approximate the mean payof, however with a slight twist. We iteratively compute the *expected total reward*, i.e. the expected sum of rewards obtained after *n* steps, by iterating *v<sup>n</sup>*+1(*s*) = *r*(*s*) + *δ*(*s*)⟨*vn*⟩. It turns out that the *increase ∆n*(*s*) = *v<sup>n</sup>*+1(*s*) − *vn*(*s*) approximates the mean payof, i.e. mp*<sup>r</sup>* (*s*) = lim*n*→∞ *∆n*(*s*) [29, Thm. 9.4.5 a)]. Moreover, we have min*<sup>s</sup>* ′∈*<sup>S</sup> ∆n*(*s* ′ ) ≤ mp*<sup>r</sup>* (*s*) ≤ max*<sup>s</sup>* ′∈*<sup>S</sup> ∆n*(*s* ′ ), yielding a correct stopping criterion [29, Thm. 9.4.5 b)]. Finally, on BSCCs these upper and lower bounds always converge [29, Cor. 9.4.6 b)], yielding termination guarantees. We provide further details on VI for mean payof in [26, App. A.3].

## **3 Building Blocks**

To arrive at a practical algorithm approximating the stationary distribution, we propose to employ sampling-based techniques, inspired by, e.g. [7,2,18]. Intuitively, these approaches repeatedly sample paths and compute bounds on a single property such as reachability or mean payof. The sampling is designed to follow probable paths with high probability, hence the computation automatically focuses on the most relevant parts of the system. Additionally, by building the system *on the fy*, construction of hardly reachable parts of the system may be avoided altogether, yielding immense speed-ups for some models (see, e.g., [18] for additional background). We apply a series of tweaks to the original idea to tailor this approach to our use case, i.e. approximating the stationary distribution.

In this section, we present the "building blocks" for our approximate approach. In the spirit of Eq. (3), we discuss how we handle a single BSCC and how to approximate the reachability probabilities of all BSCCs. In the following section, we then combine these two approaches in a non-trivial manner.

#### **3.1 Bounds in BSSCs through Mean Payof**

It is well known that the mean payof can be computed directly from the stationary distribution [29, Prop. 8.1.1], namely:

$$\text{mp}\_r(s) = \sum\_{s' \in S} \pi\_{\mathsf{M},s}^{\infty}(s') \cdot r(s') \tag{4}$$

In this section, we propose the opposite, namely computing the stationary distribution of a BSCC through mean payof queries. Fix a Markov chain M = (*S, δ*) which comprises a single BSCC, i.e. *S* ∈ BSCC(M), and defne *r*(*s* ′ ) = 1{*s*}(*s* ′ ), i.e. 1 for *s* and 0 otherwise. Then, the mean payof corresponds to the frequency of *s* appearing, i.e. the stationary distribution. Formally, we have that *π*<sup>∞</sup> M*,s*ˆ (*s*) = mp*<sup>r</sup>* (*s* ′ ) for any state *s* ′ (in a BSCC, all states have the same value). This also follows directly by inserting in Eq. (4). So, naively, for each state of the BSCC, we can solve a mean payof query, and from these results obtain the overall stationary distribution.

#### **Algorithm 1** Approximate Stationary Distribution in BSCC

**Input:** Markov chain M = (*S, δ*) with BSCC(M) = {*S*} **Output:** Bounds *l, u* on stationary distribution *π* ∞ *<sup>S</sup>* . 1: *n* ← 1 2: **for** *s* ∈ *S* **do** *l*1(*s*) ← 0, *u*1(*s*) ← 1 3: **for** *s* ∈ *S* **do** 4: *m* ← 1, *v*<sup>1</sup> ← InitGuess(*s*) 5: **while** not ShouldStop(*s, m, ∆m*) **do** *▷ Iterate until some stopping criterion* 6: **for** *s* ′ ∈ *S* **do** *v<sup>m</sup>*+1(*s* ′ ) ← 1{*s*}(*s* ′ ) + *δ*(*s* ′ )⟨*vm*⟩ *▷ Mean payof VI for s* 7: *m* ← *m* + 1 8: *l* ′ *<sup>n</sup>*(*s*) ← max *ln*(*s*)*,* min*s*′∈*<sup>S</sup> ∆m*(*s* ′ ) , *u* ′ *<sup>n</sup>*(*s*) ← min *un*(*s*)*,* max*s*′∈*<sup>S</sup> ∆m*(*s* ′ ) 9: **for** *s* ′ ∈ *S* \ {*s*} **do** *l* ′ *<sup>n</sup>*(*s* ′ ) ← *ln*(*s* ′ ), *u* ′ *<sup>n</sup>*(*s* ′ ) ← *un*(*s* ′ ) 10: **for** *s* ′ ∈ *S* **do** *▷ Update bounds based on current results (optional)* 11: *l<sup>n</sup>*+1(*s* ′ ) ← max *l* ′ *<sup>n</sup>*(*s* ′ )*,* 1 − P *<sup>s</sup>*′′∈*S,s*′′̸=*s*′ *u* ′ *<sup>n</sup>*(*s* ′′) 12: *u<sup>n</sup>*+1(*s* ′ ) ← min *u* ′ *<sup>n</sup>*(*s* ′ )*,* 1 − P *s*′′∈*S,s*′′̸=*s*′ *l* ′ *<sup>n</sup>*(*s* ′′) 13: *n* ← *n* + 1 and copy all unchanged values from *n* to *n* + 1 14: **return** (*ln, un*)

At frst, this may seem excessive, especially considering that computing the complete stationary distribution is as hard as determining the mean payof for one state (both can be obtained by solving a linearly sized equation system). However, this idea yields some interesting benefts. Firstly, using the approximation approach discussed in Section 2, we obtain a practical approximation scheme with converging bounds for each state. As such, we can quickly stop the computation if the bounds converge fast. Moreover, we can pause and restart the computation for each state, which we will use later on in order to focus on crucial states. Finally, observe that *π*<sup>∞</sup> *<sup>R</sup>* is a distribution. Thus, having lower bounds on some states actually already yields upper bounds for remaining states. Formally, for some lower bound *l* : *S* → [0*,* 1], we have *π*<sup>∞</sup> *<sup>R</sup>* (*s*) ≤ 1 − P *s* ′∈*S,s*′̸=*s l*(*s* ′ ). If during our computation it turns out that a few states are actually visited very frequently, i.e. the sum of their lower bounds is close to 1, we can already stop the computation without ever investigating the other states. Note that this only is possible since we obtain provably correct bounds.

Combining these ideas, we present our frst algorithm template in Algorithm 1. We solve each state separately, by applying the classical value iteration approach for mean payof until a termination criterion is satisfed. To allow for modifcations, we leave the defnition of several sub-procedures open. Firstly, InitGuess initializes the value vector for each mean payof computation. We can naively choose 0 everywhere, obtain an initial guess by heuristics, or re-use previously computed values. Secondly, ShouldStop decides when to stop the iteration for each state. A simple choice is to iterate until max *∆m*(*s*) − min *∆m*(*s*) *< ε* for some precision requirement *ε*. By results on mean payof, we can conclude that in this case the stationary distribution is computed with a precision of *ε*. However, as we argue later on, more sophisticated choices are possible. Finally, the order in which states are chosen is not fxed. Indeed, any order yields correct results, however heuristically re-ordering the states may also bring practical benefts.

Before we continue, we briefy argue that the algorithm is correct.

**Theorem 1.** *The result returned by Algorithm 1 is correct for any MC* M = (*S, δ*) *with* BSCC(M) = {*S*}*.*

*Proof (Sketch).* Correctness of the mean payof iteration follows from the defnition of the reward function, Eq. (4), and the correctness of value iteration for mean payof [29, Sec. 8.5]. In particular, note that the states of the MC form a single BSCC and the model is *unichain* (see [29, Chp. A]), implying that all states have the same value. For *l* and *u*, we prove correctness inductively. The initial values are trivially correct. The updates based on the mean payof computation are correct by the above arguments and by induction hypothesis: The maximum of two correct lower bounds still is a lower bound, analogous for the upper bound. The updates based on the bounds are correct since *π*<sup>∞</sup> *<sup>R</sup>* is a distribution and *l* ′ , *u* ′ are correct bounds. ⊓⊔

We deliberately omit introducing an explicit precision requirement in the algorithm, since we will use it as a building block later on.

*Remark 1.* A variant of this approach also allows for memory savings: By handling one state at a time, we only need to store linearly many additional values (in the number of states) at any time, while an explicit equation system may require quadratic space. This only yields a constant factor improvement if the system is represented explicitly (storing *δ* requires as much space), however can be of signifcant merit for symbolically encoded systems. Note that this comes at a cost: As we cannot stop and resume the computation for diferent states, we have to determine the correct result up to the required precision immediately.

#### **3.2 Reachability and Guided Sampling**

As mentioned before, the second challenge to obtain a stationary distribution is the reachability probability for each BSCC. We employ a sampling-based approach using insights from [7]. There, the authors considered a single reachability objective, i.e. a single value per state. In contrast, we need to bound reachability probabilities for each BSCC. For now, suppose that all BSCCs are already discovered and their respective stationary distribution is already computed (or approximated). In other words, we have for each BSCC *R* ∈ BSCC(M) bounds *l <sup>R</sup>, u<sup>R</sup>* : *R* → [0*,* 1] with *lR*(*s*) ≤ *π*<sup>∞</sup> *<sup>R</sup>* (*s*) ≤ *uR*(*s*), and we want to obtain bounds on the stationary distribution, i.e. functions *l*, *u* such that *l*(*s*) ≤ *π*<sup>∞</sup> M*,s*ˆ (*s*) ≤ *u*(*s*). We propose to additionally compute bounds on the probability to reach each BSCC *R*, i.e. functions *l* ♢*<sup>R</sup>* and *u* ♢*<sup>R</sup>* such that *l* ♢*R*(*s*) ≤ PrM*,s*[♢*R*] ≤ *u* ♢*R*(*s*). By Eq. (3), we then have for each state *s* a bound on the stationary distribution

$$\sum\_{R\in\text{BSCC}(\mathsf{M})} l^{\lozenge R}(\hat{s}) \cdot l^{R}(s) \le \pi\_{\mathsf{M}, \hat{s}}^{\infty}(s) \le \sum\_{R\in\text{BSCC}(\mathsf{M})} u^{\lozenge R}(\hat{s}) \cdot u^{R}(s).$$

We take a route similar to [7]. There, the algorithm essentially samples a path through the system, possibly guided by a heuristic, terminates the sampling based on several criteria, and then propagates the reachability value backwards along the path, repeating until termination. We propose a simple modifcation, namely to sample until a BSCC is reached, and then propagate the reachability

#### **Algorithm 2** Approximate BSCC Reachability

**Input:** Markov chain M = (*S, δ*) **Output:** For each BSCC *R* bounds *l* ♢*<sup>R</sup>, u*♢*<sup>R</sup>* on the probability to reach *R*. 1: *B* ← S *<sup>R</sup>*∈BSCC(M) *R*, *n* ← 1 2: **for** *R* ∈ BSCC(M) **do** 3: **for** *s* ∈ *R* **do** *l* ♢*R* 1 (*s*) ← 1, *u* ♢*R* 1 (*s*) ← 1 4: **for** *s* ∈ *B* \ *R* **do** *l* ♢*R* 1 (*s*) ← 0, *u* ♢*R* 1 (*s*) ← 0 5: **for** *s* ∈ *S* \ *B* **do** *l* ♢*R* 1 (*s*) ← 0, *u* ♢*R* 1 (*s*) ← 1 6: **while** ShouldSample **do** *▷ Sample until some stopping criterion* 7: *P* ← SampleStates *▷ Select states to update (e.g. sample a path)* 8: **for** *R* ∈ SelectUpdate(*P*) **do** *▷ Select BSCCs to update* 9: **for** *s* ∈ *P* **do** 10: *l* ♢*R <sup>n</sup>*+1(*s*) ← *δ*(*s*)⟨*l* ♢*R <sup>n</sup>* ⟩ 11: *u* ♢*R <sup>n</sup>*+1(*s*) ← *δ*(*s*)⟨*u* ♢*R <sup>n</sup>* ⟩ 12: **for** *s* ∈ *S* **do** *▷ Update bounds based on current results (optional)* 13: **for** *R* ∈ BSCC(M) **do** 14: *l* ♢*R <sup>n</sup>*+1(*s*) ← max *l* ♢*R <sup>n</sup>* (*s*)*,* 1 − P *R*′∈BSCC(M)*,R*′̸=*R u R*′ *<sup>n</sup>* (*s*) 15: *u* ♢*R <sup>n</sup>*+1(*s*) ← min *u* ♢*R <sup>n</sup>* (*s*)*,* 1 − P *R*′∈BSCC(M)*,R*′̸=*R l R*′ *<sup>n</sup>* (*s*) 16: *n* ← *n* + 1 and copy unchanged values from *l* ♢*R <sup>n</sup>* and *u* ♢*R <sup>n</sup>* to *l* ♢*R <sup>n</sup>*+1 and *u* ♢*R n*+1 17: **return** {(*l* ♢*<sup>R</sup>, u*♢*<sup>R</sup>*) | *R* ∈ BSCC(*R*)}

values of that particular BSCC back along the path. Moreover, we can employ a similar trick as above: Due to Lemma 1, the reachability probabilities of BSCCs sum up to one, i.e. P *<sup>R</sup>*∈BSCC(M) PrM*,s*[♢*R*] = 1 for every state *<sup>s</sup>*. Hence, the sum of lower bounds also yields upper bounds for other BSCCs, even those we have never encountered so far.

Our ideas are summarized in Algorithm 2. As before, the algorithm leaves several choices open. Instead of requiring to sample a path, our algorithm allows to select an arbitrary set of states to update. We note that the exact choice of this sampling mechanism does not improve the worst case runtime. However, as frst observed in [7], specially crafted *guidance heuristics* can achieve dramatic practical speed-ups on several models. Later on, we combine our two algorithms and derive such a heuristic. For now, we briefy prove correctness.

**Theorem 2.** *The result returned by Algorithm 2 is correct for any MC* M = (*S, δ*) *with* BSCC(M) = {*S*}*.*

*Proof (Sketch).* Similar to the previous algorithm, we prove correctness by induction. The initial values for *l* ♢*<sup>R</sup>* and *u* ♢*<sup>R</sup>* are correct. Then, assume that *l* ♢*R <sup>n</sup>* and *u* ♢*R <sup>n</sup>* are correct bounds. The correctness of the back propagation updates follows directly by inserting in Eq. (1) (or other works on interval value iteration [7,14]). Updates based on the bounds in other states are correct by Lemma 1 – the sum of all BSCC reachability probabilities is 1. Together, this yields correctness of the bounds computed by the algorithm. ⊓⊔

To obtain termination, it is sufcient to require that every state eventually is selected "arbitrarily often" by SampleStates. However, as before, we delegate the termination proof to our combined algorithm in the following section.

## **4 Dynamic Computation with Partial Exploration**

Recall that our overarching goal is to approximate the stationary distribution through Eq. (4). In the previous section, we have seen how we can (i) obtain approximations for a given BSCC and (ii) how to approximate the reachability probabilities of all BSCCs through sampling. However, the naive combination of these algorithms would require us to compute the set of all BSCCs, approximate the stationary distribution in each of them until a fxed precision, and additionally approximate reachability for each of them.

We now combine both ideas to obtain a sampling-based algorithm, capable of partial exploration, that focusses computation on relevant parts of the system. In particular, we construct the system dynamically, identify BSCCs on the fy, and interleave the exploration with both the approximation inside each explored BSCC (Algorithm 1) and the overall reachability computation (Algorithm 2). Moreover, we focus computation on BSCCs which are likely to be reached and thus have a higher impact on the overall error of the result. Together, our approach roughly performs the following steps until the required precision is achieved:


We frst formalize a generic framework which can instantiate the classical, precise approach as well as our approximation building blocks and then explain our concrete variant of this framework to efciently obtain *ε*-precise bounds.

#### **4.1 The Framework**

Since our goal is to allow for both precise as well as approximate solutions, we phrase the framework using lower and upper bounds together with abstract refnement procedures. We frst explain our algorithm and how it generalizes the classical approach. Then, we prove its correctness under general assumptions. Finally, we discuss several approximate variants.

Algorithm 3 essentially repeats three steps until the termination condition in Line 4 is satisfed. First, we update the set of known BSCCs through UpdateB-SSCs. In the classical solution, this function simply computes BSCC(M) once; our on-the-fy construction would repeatedly check for newly discovered BSCCs, dynamically growing the set B*n*. Then, we select BSCCs for which we should update the stationary distribution bounds. The classical solution solves the fxed point equation we have discussed in Section 2 for all BSCCs, i.e. SelectDistributionUpdates yields BSCC(M) and RefineDistribution the precisely computed values both as upper and lower bounds. Alternatively, we could, for example, select a single BSCC and apply a few iterations of Algorithm 1. Next, we update reachability bounds for a selected set of BSCCs. Again, the classical solution solves the reachability problem precisely for each BSCC through Eq. (1). Instead, we could employ value iteration as suggested by Algorithm 2.

## **Algorithm 3** Stationary Distribution Computation Framework

**Input:** Markov chain M = (*S, δ*), initial state *s*ˆ, precision *ε >* 0 **Output:** *ε*-precise bounds *l, u* on the stationary distribution *π* ∞ M*,s*ˆ 1: **for** *s* ∈ *S* **do** *▷ Initial bounds for all possible BSSCs that can be discovered* 2: *l* ♢◦ 1 (*s*) = 0, *u* ♢◦ 1 (*s*) = 1, *l* ◦ <sup>1</sup>(*s*) ← 0, *u* ◦ <sup>1</sup>(*s*) ← 1 3: *n* ← 1, B<sup>1</sup> ← ∅ 4: **while** 1 − P *R*∈B*n l* ♢*R <sup>n</sup>* (ˆ*s*) + P *R*∈B*n l* ♢*R <sup>n</sup>* (ˆ*s*) · max*s*∈*S*(*u R <sup>n</sup>* (*s*) − *l R <sup>n</sup>* (*s*)) *> ε* **do** 5: *n* ← *n* + 1 6: B*<sup>n</sup>* ← UpdateBSSCs, *B<sup>n</sup>* ← S *R*∈B*n R ▷ Discover new BSCCs* 7: **for** *R* ∈ B*<sup>n</sup>* \ B*n*−1, *s* ∈ *R* **do** *▷ Update trivial reach bounds* 8: *l* ♢*R <sup>n</sup>* (*s*) ← 1 *▷ s* ∈ *R surely reaches R* 9: **for** ◦ ̸= *R* **do** *u* ♢◦ *<sup>n</sup>* (*s*) ← 0 *▷ s* ∈ *R reaches no other BSCC* 10: **for** *R* ∈ SelectDistributionUpdates(B*n*) ∩ B*<sup>n</sup>* **do** 11: (*l R <sup>n</sup> , u<sup>R</sup> <sup>n</sup>* ) ← RefineDistribution(*R*) *▷ Update BSCC bounds* 12: **for** *R* ∈ SelectReachUpdates(B*n*) ∩ B*<sup>n</sup>* **do** 13: (*l* ♢*R <sup>n</sup> , u*♢*<sup>R</sup> <sup>n</sup>* ) ← RefineReach(*R*) *▷ Update reachability bounds* 14: Copy unchanged variables from *n* − 1 to *n* 15: *L* ← P *R*∈B*n l* ♢*R <sup>n</sup>* (ˆ*s*) 16: **for** *R* ∈ B*n*, *s* ∈ *R* **do** 17: *l*(*s*) ← *l* ♢*R <sup>n</sup>* (ˆ*s*) · *l R <sup>n</sup>* (*s*) 18: *u*(*s*) ← min(*u* ♢*R <sup>n</sup>* (ˆ*s*)*,* 1 − *L* + *l* ♢*R <sup>n</sup>* (ˆ*s*)) · *u R <sup>n</sup>* (*s*) 19: **for** *s* ∈ *S* \ *B<sup>n</sup>* **do** *l*(*s*) ← 0, *u*(*s*) ← 0 20: **return** (*l, u*)

Before we present our variant, we prove correctness under weak assumptions. We note a subtlety of the termination condition: One may assume that upper bounds on the reachability are required to bound the overall error caused by each BSCC. Yet, as we show in the following theorem, *lower* bounds are sufcient. The upper bound is implicitly handled by the frst part of the termination condition.

**Theorem 3.** *The result returned by Algorithm 3 is correct, i.e. ε precise bounds on the stationary distribution, if (i)* B*<sup>n</sup>* ⊆ B*<sup>n</sup>*+1 ⊆ BSCC(M) *for all n, and (ii)* RefineDistribution *and* RefineReach *yield correct, monotone bounds.*

The proof can be found in [26, App. B.1].

*Remark 2.* Technically, the algorithm does not need to track explicit upper bounds on the reachability of each BSCC at all. Indeed, for a BSCC *R* ∈ B*n*, we could use 1 − P *R*′∈BSCC(M)\{*R*} *l* ♢*R* ′ *n* (*s*) as upper bound and still obtain a correct algorithm. However, tracking a separate upper bound is easier to understand and has some practical benefts for the implementation.

We exclude a proof of termination, since this strongly depends on the interplay between the functions left open. We provide a general, technical criterion together with a proof in [26, App. B.2]. Intuitively, as one might expect, we require that eventually UpdateBSSCs identifes all relevant BSCCs, SelectDistributionUpdates and SelectReachUpdates select all relevant BSCCs, and RefineDistribution and RefineReach converge to the respective true value. In the following, we present a concrete template which satisfes this criterion.

#### **4.2 Sampling-Based Computation**

We present our instantiation of Algorithm 3 using guided sampling and heuristics. Since the details of the sampling guidance heuristic are rather technical, we focus on how the template functions UpdateBSSCs, SelectDistributionUpdates, RefineDistribution, SelectReachUpdates, and RefineReach are instantiated. For now, the reader may assume that states are, e.g., selected by sampling random paths through the system.


We prove that this yields correct results and terminates with probability 1 through Theorem 3. Note that this description leaves exact details of the sampling open. Thus, we prove termination using (weak) conditions on the sampling mechanism. For readability, we defne the shorthand err*<sup>R</sup> <sup>n</sup>* = max*s*∈*<sup>R</sup> u R n* (*s*) − *l R n* (*s*) denoting the overall error of the stationary distribution in BSCC *R* and err♢*<sup>R</sup> n* (*s*) = *u* ♢*R n* (*s*) − *l* ♢*R n* (*s*) the error bound on the reachability of *R* from *s*.

**Theorem 4.** *Algorithm 3 instantiated with our sampling-based approach yields correct results and terminates with probability 1 if, with probability 1,*


*where "arbitrarily often" means that if the algorithm would not terminate, this would happen infnitely often.*

The proof can be found in [26, App. B.3].

Due to space constraints, we omit an in-depth description of our sampling method and only provide a brief summary here. In summary, our algorithm frst selects a "sampling target" which is either "the unknown", i.e. states not seen so far, to encourage exploration in the style of [18], or a known BSCC, to bias sampling towards it. We select a choice randomly, weighted by its current potential infuence on the precision. The sampling process is guided by the chosen target, taking actions which lead to the respective target with high probability. In technical terms, we sample successors weighted by the upper

bound on reachability probability times the transition probability. Once the target is reached, we either explore the unknown, or improve precision in the reached BSCC. Finally, information is back-propagated along the path. Further details, in particular pitfalls we encountered during the design process, together with a complete instantiation of our algorithm can be found in [26, App. C].

## **5 Experimental Evaluation**

In this section, we evaluate our approaches, comparing to both our own reference implementation using classical methods, as well as the established model checker PRISM [21]. (The other popular model checkers Storm [10] and IscasMC/ePMC [15] do not directly support computing stationary distributions.) We implemented our methods in Java based on PET [24], running on consumer hardware (AMD Ryzen 5 3600). To solve arising linear equation systems, we use Jeigen v1.2. All executions are performed in a Docker container, restricted to a single CPU core and 8GB of RAM. For approximations, we require a precision of *ε* = 10<sup>−</sup><sup>4</sup> .

*Tools* Aside from PRISM<sup>1</sup> , we consider three variants of Algorithm 3, namely Classic, the classical approach, solving each BSCC through a linear equation system and then approximating the reachability through PRISM (using interval iteration), Naive, the naive sampling approach, following the transition dynamics, and Sample, our sampling approach, selecting a target and steering towards it. The sourcecode of our implementation used to run these experiments as well as all models and our data is available at [25]. Moreover, the current version can be found at GitHub [23].

We mention two points relevant for the comparison. First, as we show in the following, PRISM may yield wrong results due to a (too) simple computation. As such, we should not expect that our correct methods are on par or even faster. Second, our implementation employs conservative procedures to further increase quality of the result, such as compensated summation to mitigate numerical error due to foating-point imprecision, noticeably increasing computational efort.

*Models* We consider the PRISM benchmark suite<sup>2</sup> [22], comprising several probabilistic models, in particular DTMC, CTMC, and MDP. Since there are not too many Markov chains in this set, we obtain further models as follows. For each CTMC, we consider the *uniformized CTMC* (which preserves the steady state distribution), and for MDP we choose actions uniformly at random. Unfortunately, *all* models obtained this way either comprise only single-state BSCCs or the whole model is a single BSCC. In the former case, our approximation within the BSCC is not used at all, in the latter, a sampling based approach needs to invest additional time to discover the whole system. In order to better compare the performance of our mean payof based approximation approach, in these cases

<sup>1</sup> We observed that the default hybrid engine typically is signifcantly slower than the "explicit" variant and thus use that one, see [26, App. D].

<sup>2</sup> Obtained from https://github.com/prismmodelchecker/prism-benchmarks.

$$\sqrt[4]{\mathbf{c}\_1} \sum\_{1-e}^{\frac{1}{2}} \sum\_{s\_2} \dots \sum\_{s\_2} \dots \sum\_{s\_1} \dots \sum\_{s\_1} \dots \sum\_{s\_2} \mathbf{b}^{\frac{1}{2}}$$

**Fig. 2.** A small MC where PRISM reports wrong results for *e* ≤ 10<sup>−</sup><sup>7</sup> .

we pre-explore the whole system and compute the stationary distribution directly through Algorithm 1. To compare the combined performance, we additionally consider a handcrafted model, named **branch**, which comprises both transient states as well as several non-trivial BSCCs.

We present selected results, highlighting diferent strengths and weaknesses of each approach. An evaluation of the complete suite can be found in [26, App. D].

*Correctness* We discovered that PRISM potentially yields wrong results, due to an unsafe stopping criterion. In particular, PRISM iterates the power method until the absolute diference between subsequent iterates is small, exactly as with its "unsafe" value iteration for reachability, as reported by e.g. [7]. On the model from Fig. 2, PRISM (with explicit engine) immediately terminates, printing a result of ≈ ( 1 6 *,* 1 6 *,* 1 3 *,* 1 3 ). However, the correct stationary distribution is ≈ ( 1 9 *,* 2 9 *,* 4 9 *,* 2 9 ) (from left to right), which both of our methods correctly identify. This behaviour is due to the small diference between frst and second eigenvalue of the transition matrix, which in turn implies that the iterates of the power method only change by a small amount. We note that on this example, PRISM's default hybrid engine eventually yields the correct result (after ≈ 10<sup>8</sup> iterations) due to the used iteration scheme. On small variation of the model (included in the artefact) it also terminates immediately with the wrong result.

*Results* We summarize our results in Table 1. We observe several points. First, we see that the naive sampling approach can hardly handle non-trivial models. Second, our guided sampling approach achieves signifcant improvements on several models over both the classical, correct method as well as the potentially unsound approach of PRISM, in particular when hardly reachable portions of the state space can be completely discarded. However, on other models, the classical approach seems to be more appropriate, in particular on models with many likely to be reached BSCCs. Here, the sampling approach struggles to propagate the reachability bounds of all BSCCs simultaneously. Finally, as suggested by the **phil** and **rabin** models, using mean payof based approximation can signifcantly outperform classical equation solving. In summary, PRISM, Classic, and Sample all can be the fastest method, depending on the structure of the model. However, recall that PRISM's method does not give guarantees on the result.

*Further Discussion* As expected, we observed that the runtime of approximation can increase drastically for smaller precision requirements (e.g. *ε* = 10<sup>−</sup><sup>8</sup> ) and solving the equation system precisely may actually be faster for some BSCCs. However, especially in the combined approach, if we already have some upper bounds on the reachability probability of a certain BSCC, we do not need to solve it with the original precision. Hence, a future version of the implementation could

**Table 1.** Overview of our results. For each model, we list its parameters, overall size, and number of BSCCs, followed by the total execution time in seconds for each tool, TO denotes a timeout (300 seconds), MO a memout, and err an internal error. On systems comprising a single BSCC, the Naive and Sample approach coincide.


dynamically decide whether to solve a BSCC based on mean payof approximation or equation solving, combining advantages of both worlds.

Secondly, this also highlights an interesting trade-of implicit to our approach: The algorithm needs to balance between exploring unknown areas and refning bounds on known BSCCs, in particular, since exploring a new BSCC adds noticeable efort: One more target for which the reachability has to be determined. Here, more sophisticated heuristics could be useful.

Finally, for models with large BSCCs, such as **rabin**, we also observed that the classical linear equation approach indeed runs out of memory while a variant of the approximation algorithm can still solve it, as indicated by Remark 1. Thus, the implementation could moreover take memory constraints into account, deciding to apply the memory-saving approach in appropriate cases.

## **6 Conclusion**

We presented a new perspective on computing the stationary distribution in Markov chains by rephrasing the problem in terms of mean payof and reachability. We combined several recent advances for these problems to obtain a sophisticated partial-exploration based algorithm. Our evaluation shows that on several models our new approach is signifcantly more performant. As a major technical contribution, we provided a general algorithmic framework, which encompasses both the classical solution approach as well as our new method.

As hinted by the discussion above, our framework is quite fexible. For future work, we particularly want to identify better guidance heuristics. Specifcally, based on experimental data, we conjecture that the reachability part can be improved signifcantly. Moreover, due to the fexibility of our framework, we can apply diferent methods for each BSCC to obtain the reachability and stationary distribution. Thus, we want to fnd meta-heuristics which suggest the most appropriate method in each case. For example, for smaller BSCCs, we could use the classical, precise solution method to obtain the stationary distribution, while for larger ones we employ our mean payof approach, and, in the spirit of Remark 1, for even larger ones we approximate them to the required precision immediately, saving memory. Additionally, we could identify BSCCs that satisfy the conditions of specialized approaches such as [11].

## **References**


July 8-10, 2013. Proceedings. Lecture Notes in Computer Science, vol. 7984, pp. 380–395. Springer (2013). https://doi.org/10.1007/978-3-642-39408-9\\_27


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4. 0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Robust Almost-Sure Reachability in Multi-Environment MDPs

Marck van der Vegt() , Nils Jansen , and Sebastian Junges

Radboud University, Nijmegen, The Netherlands {marck.vandervegt,nils.jansen,sebastian.junges}@ru.nl

Abstract. Multiple-environment MDPs (MEMDPs) capture finite sets of MDPs that share the states but differ in the transition dynamics. These models form a proper subclass of partially observable MDPs (POMDPs). We consider the synthesis of policies that robustly satisfy an almost-sure reachability property in MEMDPs, that is, one policy that satisfies a property for all environments. For POMDPs, deciding the existence of robust policies is an EXPTIME-complete problem. We show that this problem is PSPACE-complete for MEMDPs, while the policies require exponential memory in general. We exploit the theoretical results to develop and implement an algorithm that shows promising results in synthesizing robust policies for various benchmarks.

## 1 Introduction

Markov decision processes (MDPs) are the standard formalism to model sequential decision making under uncertainty. A typical goal is to find a policy that satisfies a temporal logic specification [5]. Probabilistic model checkers such as Storm [22] and Prism [30] efficiently compute such policies. A concern, however, is the robustness against potential perturbations in the environment. MDPs cannot capture such uncertainty about the shape of the environment.

Multi-environment MDPs (MEMDPs) [36,14] contain a set of MDPs, called environments, over the same state space. The goal in MEMDPs is to find a single policy that satisfies a given specification in all environments. MEMDPs are, for instance, a natural model for MDPs with unknown system dynamics, where several domain experts provide their interpretation of the dynamics [11]. These different MDPs together form a MEMDP. MEMDPs also arise in other domains: The guessing of a (static) password is a natural example in security. In robotics, a MEMDP captures unknown positions of some static obstacle. One can interpret MEMDPs as a (disjoint) union of MDPs in which an agent only has partial observation, i.e., every MEMDP can be cast into a linearly larger partially observable MDP (POMDP) [27]. Indeed, some famous examples for POMDPs are in fact MEMDPs, such as RockSample [39] and Hallway [31]. Solving POMDPs is notoriously hard [32], and thus, it is worthwhile to investigate natural subclasses.

We consider almost-sure specifications where the probability needs to be one to reach a set of target states. In MDPs, it suffices to consider memoryless policies. Constructing such policies can be efficiently implemented by means of a graph-search [5]. For MEMDPs, we consider the following problem:

Compute one policy that almost-surely reaches the target in all environments.

Such a policy robustly satisfies an almost-sure specification for a set of MDPs. Our approach. Inspired by work on POMDPs, we construct a belief-observation MDP (BOMDP) [16] that tracks the states of the MDPs and the (support of the) belief over potential environments. We show that a policy satisfying the almost-sure property in the BOMDP also satisfies the property in the MEMDP.

Although the BOMDP is exponentially larger than the MEMDP, we exploit its particular structure to create a PSPACE algorithm to decide whether such a robust policy exists. The essence of the algorithm is a recursive construction of a fragment of the BOMDP, restricted to a setting in which the belief-support is fixed. Such an approach is possible, as the belief in a MEMDP behaves monotonically: Once we know that we are not in a particular environment, we never lose this knowledge. This behavior is in contrast to POMDPs, where there is no monotonic behavior in belief-supports. The difference is essential: Deciding almost-sure reachability in POMDPs is EXPTIME-complete [37,19]. In contrast, the problem of deciding whether a policy for almost-sure reachability in a MEMDP exists is indeed PSPACE-complete. We show the hardness using a reduction from the true quantified Boolean formula problem. Finally, we cannot hope to extract a policy with such an algorithm, as the smallest policy for MEMDPs may require exponential memory in the number of environments.

The PSPACE algorithm itself recomputes many results. For practical purposes, we create an algorithm that iteratively explores parts of the BOMDP. The algorithm additionally uses the MEMDP structure to generalize the set of states from which a winning policy exists and deduce efficient heuristics for guiding the exploration. The combination of these ingredients leads to an efficient and competitive prototype on top of the model checker Storm.

Related work. We categorize related work in three areas.

MEMDPs. Almost-sure reachability for MEMDPs for exactly two environments has been studied by [36]. We extend the results to arbitrarily many environments. This is nontrivial: For two environments, the decision problem has a polynomial time routine [36], whereas we show that the problem is PSPACE-complete for an arbitrary number of environments. MEMDPs and closely related models such as hidden-model MDPs, hidden-parameter MDPs, multi-model MDPs, and concurrent MDPs [11,2,40,10] have been considered for quantitative properties<sup>1</sup> . The typical approach is to consider approximative algorithms for the undecidable problem in POMDPs [14] or adapt reinforcement learning algorithms [3,28]. These approximations are not applicable to almost-sure properties.

POMDPs. One can build an underlying potentially infinite belief-MDP [27] that corresponds to the POMDP – using model checkers [35,7,8] to verify this MDP

<sup>1</sup> Hidden-parameter MDPs are different than MEMDPs in that they assume a prior over MDPs. However, for almost-sure properties, this difference is irrelevant.

can answer the question for MEMDPs. For POMDPs, almost-sure reachability is decidable in exponential time [37,19] via a construction similar to ours. Most qualitative properties beyond almost-sure reachability are undecidable [4,15]. Two dedicated algorithms that limit the search to policies with small memory requirements and employ a SAT-based approach [12,26] to this NP-hard problem [19] are implemented in Storm. We use them as baselines.

Robust models. The high-level representation of MEMDPs is structurally similar to featured MDPs [18,1] that represent sets of MDPs. The proposed techniques are called family-based model checking and compute policies for every MDP in the family, whereas we aim to find one policy for all MDPs. Interval MDPs [25,43,23] and SGs [38] do not allow for dependencies between states and thus cannot model features such as various obstacle positions. Parametric MDPs [2,44,24] assume controllable uncertainty and do not consider robustness of policies.

Contributions. We establish PSPACE-completeness for deciding almost-sure reachability in MEMDPs and show that the policies may be exponentially large. Our iterative algorithm, which is the first specific to almost-sure reachability in MEMDPs, builds fragments of the BOMDP. An empirical evaluation shows that the iterative algorithm outperforms approaches dedicated to POMDPs.

## 2 Problem Statement

In this section, we provide some background and formalize the problem statement.

For a set X, Dist(X) denotes the set of probability distributions over X. For a given distribution d ∈ Dist(X), we denote its support as Supp(d). For a finite set X, let unif(X) denote the uniform distribution. dirac(x) denotes the Dirac distribution on x ∈ X. We use short-hand notation for functions and distributions, f = [x 7→ a, y 7→ b] means that f(x) = a and f(y) = b. We write P (X) for the powerset of X. For n ∈ N we write [n] = {i ∈ N | 1 ≤ i ≤ n}.

Definition 1 (MDP). A Markov Decision Process is a tuple M = hS, A, ιinit, pi where S is the finite set of states, A is the finite set of actions, ιinit ∈ Dist(S) is the initial state distribution, and p: S × A → Dist(S) is the transition function.

The transition function is total, that is, for notational convenience MDPs are input-enabled. This requirement does not affect the generality of our results. A path of an MDP is a sequence π = s0a0s1a<sup>1</sup> . . . s<sup>n</sup> such that ιinit(s0) > 0 and p(s<sup>i</sup> , ai)(si+1) > 0 for all 0 ≤ i < n. The last state of π is last(π) = sn. The set of all finite paths is Path and Path(S 0 ) denotes the paths starting in a state from S <sup>0</sup> ⊆ S. The set of reachable states from S 0 is Reachable(S 0 ). If S <sup>0</sup> = Supp(ιinit) we just call them the reachable states. The MDP restricted to reachable states from a distribution d ∈ Dist(S) is ReachFragment(M, d), where d is the new initial distribution. A state s ∈ S is absorbing if Reachable({s}) = {s}. An MDP is acyclic, if each state is absorbing or not reachable from its successor states.

Action choices are resolved by a policy σ : Path → Dist(A) that maps paths to distributions over actions. A policy of the form σ : S → Dist(A) is

Fig. 1: Example MEMDP

called memoryless, deterministic if we have σ : Path → A; and, memoryless deterministic for σ : S → A. For an MDP M, we denote the probability of a policy σ reaching some target set T ⊆ S starting in state s as PrM(s → T | σ). More precisely, PrM(s → T | σ) denotes the probability of all paths from s reaching T under σ. We use PrM(T | σ) if s is distributed according to ιinit.

Definition 2 (MEMDP). A Multiple Environment MDP is a tuple N = hS, A, ιinit, {pi}i∈<sup>I</sup> i with S, A, ιinit as for MDPs, and {pi}i∈<sup>I</sup> is a set of transition functions, where I is a finite set of environment indices.

Intuitively, MEMDPs form sets of MDPs (environments) that share states and actions, but differ in the transition probabilities. For MEMDP N with index set I and a set I <sup>0</sup> ⊆ I, we define the restriction of environments as the MEMDP N<sup>↓</sup><sup>I</sup> <sup>0</sup> = hS, A, ιinit, {pi}i∈<sup>I</sup> <sup>0</sup> i. Given an environment i ∈ I, we denote its corresponding MDP as N<sup>i</sup> = hS, A, ιinit, pii. A MEMDP with only one environment is an MDP. Paths and policies are defined on the states and actions of MEMDPs and do not differ from MDP policies. A MEMDP is acyclic, if each MDP is acyclic.

Example 1. Figure 1 shows an MEMDP with three environments N<sup>i</sup> . An agent can ask two questions, q<sup>1</sup> and q2. The response is either 'switch' (s<sup>1</sup> ↔ s2), or 'stay' (loop). In N1, the response to q<sup>1</sup> and q<sup>2</sup> is to switch. In N2, the response to q<sup>1</sup> is stay, and to q<sup>2</sup> is switch. The agent can guess the environment using a1, a2, a3. Guessing a<sup>i</sup> leads to the target { } only in environment i. Thus, an agent must deduce the environment via q1, q<sup>2</sup> to surely reach the target.

Definition 3 (Almost-Sure Reachability). An almost-sure reachability property is defined by a set T ⊆ S of target states. A policy σ satisfies the property T for MEMDP N = hS, A, ιinit, {pi}i∈<sup>I</sup> i iff ∀i ∈ I : Pr<sup>N</sup><sup>i</sup> (T | σ) = 1.

In other words, a policy σ satisfies an almost-sure reachability property T, called winning, if and only if the probability of reaching T within each MDP is one. By extension, a state s ∈ S is winning if there exists a winning policy when starting in state s. Policies and states that are not winning are losing. We will now define both the decision and policy problem:

Given a MEMDP N and an almost-sure reachability property T. The Decision Problem asks to decide if a policy exists that satisfies T. The Policy Problem asks to compute such a policy, if it exists.

In Section 4 we discuss the computational complexity of the decision problem. Following up, in Section 5 we present our algorithm for solving the policy problem. Details on its implementation and evaluation will be presented in Section 6.

## 3 A Reduction To Belief-Observation MDPs

In this section, we reduce the policy problem, and thus also the decision problem, to finding a policy in an exponentially larger belief-observation MDP. This reduction is an elementary building block for the construction of our PSPACE algorithm and the practical implementation. Additional information such as proofs for statements throughout the paper are available in the technical report [41].

## 3.1 Interpretation of MEMDPs as Partially Observable MDPs

Definition 4 (POMDP). A partially observable MDP (POMDP) is a tuple hM, Z, Oi with an MDP M = hS, A, ιinit, pi, a set Z of observations, and an observation function O: S → Z.

A POMDP is an MDP where states are labelled with observations. We lift O to paths and use O(π) = O(s1)a1O(s2). . . O(sn). We use observation-based policies σ, i.e., policies s.t. for π, π<sup>0</sup> ∈ Path, O(π) = O(π 0 ) implies σ(π) = σ(π 0 ). A MEMDP can be cast into a POMDP that is made up as the disjoint union:

Definition 5 (Union-POMDP). Given an MEMDP N = hS, A, ιinit, {pi}i∈<sup>I</sup> i we define its union-POMDP N<sup>t</sup> = hhS 0 , A, ι<sup>0</sup> init, p<sup>0</sup> i, Z, Oi, with states S <sup>0</sup> = S ×I, initial distribution ι 0 init(hs, ii) = ιinit(s) · |I| −1 , transitions p 0 (hs, ii, a)(hs 0 , ii) = pi(s, a)(s 0 ), observations Z = S, and observation function O(hs, ii) = s.

A policy may observe the state s but not in which MDP we are. This forces any observation-based policy to take the same choice in all environments.

Lemma 1. Given MEMDP N , there exists a winning policy iff there exists an observation-based policy σ such that Pr<sup>N</sup><sup>t</sup> (T | σ) = 1.

The statement follows as, first, any observation-based policy of the POMDP can be applied to the MEMDP, second, vice versa, any MEMDP policy is observationbased, and third, the induced MCs under these policies are isomorphic.

## 3.2 Belief-observation MDPs

For POMDPs, memoryless policies are not sufficient, which makes computing policies intricate. We therefore add the information that the history — i.e., the path until some point — contains. In MEMDPs, this information is the (environment-)belief (support) J ⊆ I, as the set of environments that are consistent with a path in the MEMDP. Given a belief J ⊆ I and a state-action-state transition s <sup>a</sup>−→ s 0 , then we define Up(J, s, a, s<sup>0</sup> ) = {i ∈ J | pi(s, a, s<sup>0</sup> ) > 0}, i.e., the subset of environments in which the transition exists. For a path π ∈ Path, we define its corresponding belief B(π) ⊆ I recursively as:

$$\mathcal{B}(s\_0) = I \quad \text{and} \quad \mathcal{B}(\pi \cdot sas') = \mathsf{U}\mathfrak{p}(\mathcal{B}(\pi \cdot s), s, a, s')$$

The belief in a MEMDP monotonically decreases along a path, i.e., if we know that we are not in a particular environment, this remains true indefinitely.

We aim to use a model where memoryless policies suffice. To that end, we cast MEMDPs into the exponentially larger belief-observation MDPs [16] 2 .

Definition 6 (BOMDP). For a MEMDP N = hS, A, ιinit, {pi}i∈<sup>I</sup> i, we define its belief-observation MDP (BOMDP) as a POMDP G<sup>N</sup> = hhS 0 , A, ι<sup>0</sup> init, p<sup>0</sup> i, Z, Oi with states S <sup>0</sup> = S × I × P (I), initial distribution ι 0 init(hs, j, Ii) = ιinit(s) · |I| −1 , transition relation p 0 (hs, j, Ji, a)(hs 0 , j, J<sup>0</sup> i) = p<sup>j</sup> (s, a, s<sup>0</sup> ) with J <sup>0</sup> = Up(J, s, a, s<sup>0</sup> ), observations Z = S × P (I), and observation function O(hs, j, Ji) = hs, Ji.

Compared to the union-POMDP, BOMDPs also track the belief by updating it accordingly. We clarify the correspondence between paths of the BOMDP and the MEMDP. For a path π through the MEMDP, we can mimic this path exactly in the MDPs N<sup>j</sup> for j ∈ B(π). As we track B(π) in the state, we can deduce from the BOMDP state in which environments we can be.

Lemma 2. For MEMDP N and the path hs1, j, J1ia1hs2, j, J2i. . .hsn, j, Jni of the BOMDP G<sup>N</sup> , let j ∈ J1. Then: J<sup>n</sup> 6= ∅ and the path s1a<sup>1</sup> . . . s<sup>n</sup> exists in MDP N<sup>i</sup> iff i ∈ J<sup>1</sup> ∩ Jn.

Consequently, the belief of a path can be uniquely determined by the observation of the last state reached, hence the name belief-observation MDPs.

Lemma 3. For every pair of paths π, π<sup>0</sup> in a BOMDP, we have:

> B(π) = B(π 0 ) implies O(last(π)) = O(last(π 0 )).

For notation, we define S<sup>J</sup> = {hs, j, Ji | j ∈ J, s ∈ S}, and analogously write Z<sup>J</sup> = {hs, Ji | s ∈ S}. We lift the target states T to states in the BOMDP: T<sup>G</sup><sup>N</sup> = {hs, j, Ji | s ∈ T, J ⊆ I, j ∈ J} and define target observations T<sup>Z</sup> = O(T<sup>G</sup><sup>N</sup> ).

Definition 7 (Winning in a BOMDP). Let G<sup>N</sup> be a BOMDP with target observations TZ. An observation-based policy σ is winning from some observation z ∈ Z, if for all s ∈ O<sup>−</sup><sup>1</sup> (z) it holds that Pr<sup>G</sup><sup>N</sup> (s → O<sup>−</sup><sup>1</sup> (TZ) | σ) = 1.

Furthermore, a policy σ is winning if it is winning for the initial distribution ιinit. An observation z is winning if there exists a winning policy for z. The winning region Win<sup>T</sup> G<sup>N</sup> is the set of all winning observations.

Almost-sure winning in the BOMDP corresponds to winning in the MEMDP.

Theorem 1. There exists a winning policy for a MEMDP N with target states T iff there exists a winning policy in the BOMDP G<sup>N</sup> with target states T<sup>G</sup><sup>N</sup> .

Intuitively, the important aspect is that for almost-sure reachability, observationbased memoryless policies are sufficient [13]. For any such policy, the induced Markov chains on the union-POMDP and the BOMDP are bisimilar [16].

BOMDPs make policy search conceptually easier. First, as memoryless policies suffice for almost-sure reachability, winning regions are independent of fixed policies: For policies σ and σ 0 that are winning in observation z and z 0 , respectively, there must exist a policy σˆ that is winning for both z and z 0 . Second, winning regions can be determined in polynomial time in the size of the BOMDP [16].

<sup>2</sup> This translation is notationally simpler than going via the union-POMDP.

## 3.3 Fragments of BOMDPs

To avoid storing the exponentially sized BOMDP, we only build fragments: We may select any set of observations as frontier observations and make the states with those observations absorbing. We later discuss the selection of frontiers.

Definition 8 (Sliced BOMDP). For a BOMDP G<sup>N</sup> = hhS, A, ιinit, pi, Z, Oi and a set of frontier observations F ⊆ Z, we define a BOMDP G<sup>N</sup> |F = hhS, A, ιinit, p<sup>0</sup> i, Z, Oi with:

$$\forall s \in S, a \in A \colon p'(s, a) = \begin{cases} \mathsf{dirac}(s) & \text{if } O(s) \in F, \\ p(s, a) & \text{otherwise}. \end{cases}$$

We exploit this sliced BOMDP to derive constraints on the set of winning states.

Lemma 4. For every BOMDP G<sup>N</sup> with states S and targets T and for all frontier observations F ⊆ Z it holds that: Win<sup>T</sup> <sup>G</sup><sup>N</sup> <sup>|</sup><sup>F</sup> <sup>⊆</sup> Win<sup>T</sup> <sup>G</sup><sup>N</sup> <sup>⊆</sup> Win<sup>T</sup> <sup>∪</sup><sup>F</sup> <sup>G</sup><sup>N</sup> <sup>|</sup><sup>F</sup> .

Making (non-target) observations absorbing extends the set of losing observations, while adding target states extends the set of winning observations.

## 4 Computational Complexity

The BOMDP G<sup>N</sup> above yields an exponential time and space algorithm via Theorem 1. We can avoid the exponential memory requirement. This section shows the PSPACE-completeness of deciding whether a winning policy exists.

Theorem 2. The almost-sure reachability decision problem is PSPACE-complete.

The result follows from Lemmas 11 and 10 below. In Section 4.3, we show that representing the winning policy itself may however require exponential space.

## 4.1 Deciding Almost-Sure Winning for MEMDPs in PSPACE

We develop an algorithm with a polynomial memory footprint. The algorithm exploits locality of cyclic behavior in the BOMDP, as formalized by an acyclic environment graph and local BOMDPs that match the nodes in the environment graph. The algorithm recurses on the environment graph while memorizing results from polynomially many local BOMDPs.

The graph-structure of BOMDPs. First, along a path of the MEMDP, we will only gain information and are thus able to rule out certain environments [14]. Due to the monotonicity of the update operator, we have for any BOMDP that hs, j, Ji ∈ Reachable(hs 0 , j, J<sup>0</sup> i) implies J ⊆ J 0 . We define a graph over environment sets that describes how the belief-support can update over a run.

Definition 9 (Environment graph). Let N be a MEMDP and p the transition function of G<sup>N</sup> . The environment graph GE <sup>N</sup> = (V<sup>N</sup> ,E<sup>N</sup> ) for N is a directed graph with vertices V<sup>N</sup> = P (I) and edges

$$E\_N = \{ \langle J, J' \rangle \mid \exists s, s' \in S, a \in A, j \in I. p(\langle s, j, J \rangle, a, \langle s', j, J' \rangle) > 0 \text{ and } J \neq J' \}.$$

Fig. 2: The environment graph for our running example.

Example 2. Figure 2 shows the environment graph for the MEMDP in Ex. 1. It consists of the different belief-supports. For example, the transition from {1, 2, 3} to {2, 3} and to {1} is due to the action q<sup>1</sup> in state s0, as shown in Fig. 1.

Paths in the environment graph abstract paths in the BOMDP. Path fragments where the belief-support remains unchanged are summarized into one step, as we do not create edges of the form hJ, Ji. We formalize this idea: Let π = hs1, j, J1ia1hs2, j, J2i. . .hsn, j, Jni be a path in the BOMDP. For any J ⊆ I, we call π a J-local path, if J<sup>i</sup> = J for all i ∈ [n].

Lemma 5. For a MEMDP N with environment graph GE <sup>N</sup> , there is a path J<sup>1</sup> . . . J<sup>n</sup> iff there is a path π = π<sup>1</sup> . . . π<sup>n</sup> in G<sup>N</sup> s.t. every π<sup>i</sup> is Ji-local.

The shape of the environment graph is crucial for the algorithm we develop.

Lemma 6. Let GE <sup>N</sup> = (V<sup>N</sup> ,E<sup>N</sup> ) be an environment graph for MEMDP N . First, E<sup>N</sup> (J, J<sup>0</sup> ) implies J <sup>0</sup> ( J. Thus, G is acyclic and has maximal path length |I|. The maximal outdegree of the graph is |S| 2 |A|.

The monotonicity regarding J, J<sup>0</sup> follows from definition of the belief update. The bound on the outdegree is a consequence from Lemma 9 below.

Local belief-support BOMDPs. Before we continue, we remark that the (future) dynamics in a BOMDP only depend on the current state and set of environments. More formally, we capture this intuition as follows.

Lemma 7. Let G<sup>N</sup> be a BOMDP with states S 0 . For any state hs, j, Ji ∈ S 0 , let N <sup>0</sup> = ReachFragment(N<sup>↓</sup><sup>J</sup> , dirac(s)) and Y = {hs, i, Ji | i ∈ J}. Then:

ReachFragment(G<sup>N</sup> , unif(Y )) = GN<sup>0</sup> .

The key insight is that restricting the MEMDP does not change the transition functions for the environments j ∈ J. Furthermore, using monotonicity of the update, we only reach BOMDP-states whose behavior is determined by the environments in J.

This intuition allows us to analyze the BOMDP locally and lift the results to the complete BOMDP. We define a local BOMDP as the part of a BOMDP starting in any state in S<sup>J</sup> . All observations not in Z<sup>J</sup> are made absorbing.

Definition 10 (Local BOMDP). Given a MEMDP N with BOMDP G<sup>N</sup> and a set of environments J. The local BOMDP for environments J is the fragment

LocG(J) = ReachFragment(G<sup>N</sup>↓<sup>J</sup> |F , unif(S<sup>J</sup> )) where F = Z \ Z<sup>J</sup> .

#### Algorithm 1 Search algorithm

1: function Search(MEMDP <sup>N</sup> <sup>=</sup> <sup>h</sup>S, A, {pi}i∈<sup>I</sup> , ιiniti, J ⊆ I, T ⊆ S) 2: T <sup>0</sup> ← {hs, j, Ji | j ∈ J, s ∈ T} 3: for J 0 s.t. E<sup>N</sup> (J, J<sup>0</sup> ) do . Consider the edges in the env. graph (Def. 9) 4: WJ<sup>0</sup> ← Search(N , J<sup>0</sup> , T) . Recursion! 5: T <sup>0</sup> ← T <sup>0</sup> ∪ {hs, j, J<sup>0</sup> i | j ∈ J,hs, J<sup>0</sup> i ∈ WJ<sup>0</sup>} 6: return Win<sup>T</sup> 0 LocG(J) ∩ Z<sup>J</sup> . Construct BOMDP as in Def. 10, then model check 7: 8: function ASWinning(MEMDP <sup>N</sup> <sup>=</sup> <sup>h</sup>S, A, {pi}i∈<sup>I</sup> , ιiniti, T ⊆ S) 9: return O(Supp(ιinit)) ⊆ Search(N , I, T)

This definition of a local BOMDP coincides with a fragment of the complete BOMDP. We then mark exactly the winning observations restricted to the environment sets J <sup>0</sup> ( J as winning in the local BOMDP and compute all winning observations in the local BOMDP. These observations are winning in the complete BOMDP. The following concretization of Lemma 4 formalizes this.

Lemma 8. Consider a MEMDP N and a subset of environments J.

$$\mathsf{When}^{T'\_{\mathcal{G}\_{\mathcal{N}}}}\_{\text{Loc}\mathfrak{G}(J)} \cap Z\_J \;= \; \mathsf{When}^{T\_{\mathcal{G}\_{\mathcal{N}}}}\_{\mathcal{G}\_{\mathcal{N}}} \cap Z\_J \; \quad \text{with} \quad T'\_{\mathcal{G}\_{\mathcal{N}}} = T\_{\mathcal{G}\_{\mathcal{N}}} \cup \left(\mathsf{When}^{T\_{\mathcal{G}\_{\mathcal{N}}}}\_{\mathcal{G}\_{\mathcal{N}}} \; \right) \; \mathsf{Z}\_J.$$

Furthermore, local BOMDPs are polynomially bounded in the size of the MEMDP.

Lemma 9. Let N be a MEMDP with states S and actions A. LocG(J) has at most O(|S| 2 · |A| · |J|) states and O(|S| 2 · |A| · |J| 2 ) transitions<sup>3</sup> .

A PSPACE algorithm. We present Algorithm 1 for the MEMDP decision problem, which recurses depth-first over the paths in the environment graph<sup>4</sup> . We first state the correctness and the space complexity of this algorithm.

Lemma 10. ASWinning in Alg. 1 solves the decision problem in PSPACE.

To prove correctness, we first note that Search(N , J, T) computes Win<sup>T</sup>GN G<sup>N</sup> ∩Z<sup>J</sup> . We show this by induction over the structure of the environment graph. For all J without outgoing edges, the local BOMDP coincides with a BOMDP just for environments J (Lemma 7). Otherwise, observe that T 0 in line 5 coincides with its definition in Lemma 8 and thus, by the same lemma, we return Win<sup>T</sup>GN G<sup>N</sup> ∩ Z<sup>J</sup> . To finalize the proof, a winning policy exists in the MEMDP if the observation of the initial states of the BOMDP are winning (Theorem 1). The algorithm must terminate as it recurses over all paths of a finite acyclic graph, see Lemma 6. Following Lemma 9, the number of frontier states is then bounded by |S| 2 · |A|. The main body of the algorithm therefore requires polynomial space, and the maximal recursion depth (stack height) is |I| (Lemma 6). Together, this yields a space complexity in O(|S| 2 · |A| · |I| 2 ).

<sup>3</sup> The number of transitions is the number of nonzero entries in p

<sup>4</sup> In contrast to depth-first-search, we do not memorize nodes we visited earlier.

Fig. 3: Constructed MEMDP for the QBF formula ∀x∃y - (x ∨ y) ∧ (¬x ∨ ¬y) .

#### 4.2 Deciding Almost-Sure Winning for MEMDPs Is PSPACE-hard

It is not possible to improve the algorithm beyond PSPACE.

Lemma 11. The MEMDP decision problem is PSPACE-hard.

Hardness holds even for acyclic MEMDPs and uses the following fact.

Lemma 12. If a winning policy exists for an acyclic MEMDP, there also exists a winning policy that is deterministic.

In particular, almost-sure reachability coincides with avoiding the sink states. This is a safety property. For safety, deterministic policies are sufficient, as randomization visits only additional states, which is not beneficial for safety.

Regarding Lemma 11, we sketch a polynomial-time reduction from the PSPACE-complete TQBF problem [20] problem to the MEMDP decision problem. Let Ψ be a QBF formula, Ψ = ∃x1∀y1∃x2∀y<sup>2</sup> . . . ∃xn∀y<sup>n</sup> - Φ with Φ a Boolean formula in conjunctive normal form. The problem is to decide whether Ψ is true.

Example 3. Consider the QBF formula Ψ = ∀x∃y - (x ∨ y) ∧ (¬x ∨ ¬y) . We construct a MEMDP with an environment for every clause, see Figure 3 5 . The state space consists of three states for each variable v ∈ V : the state v and the states v> and v⊥ that encode their assignment. Additionally, we have a dedicated target W and sink state F. We consider three actions: The actions true (>) and false (⊥) semantically describe the assignment to existentially quantified variables. The action any α<sup>⊗</sup> is used for all other states. Every environment reaches the target state iff one literal in the clause is assigned true.

In the example, intuitively, a policy should assign the negation of x to y. Formally, the policy σ, characterized by σ(π · y) = > iff x<sup>⊥</sup> ∈ π, is winning.

As a consequence of this construction, we may also deduce the following theorem.

#### Theorem 3. Deciding whether a memoryless winning policy exists is NP-complete.

The proof of NP hardness uses a similar construction for the propositional SAT fragment of QBF, without universal quantifiers. Additionally, the problem for memoryless policies is in NP, because one can nondeterministically guess a (polynomially sized) memoryless policy and verify in each environment independently.

<sup>5</sup> We depict a slightly simplified MEMDP for conciseness.

Fig. 4: Witness for exponential memory requirement for winning policies.

### 4.3 Policy Problem

Policies, mapping histories to actions, are generally infinite objects. However, we may extract winning policies from the BOMDP, which is (only) exponential in the MEMDP. Finite state controllers [34] are a suitable and widespread representation of policies that require only a finite amount of memory. Intuitively, the number of memory states reflects the number of equivalence classes of histories that a policy can distinguish. In general, we cannot hope to find smaller policies than those obtained via a BOMDP.

Theorem 4. There is a family of MEMDPs {N <sup>n</sup>}n≥<sup>1</sup> where for each n, N <sup>n</sup> has 2n environments and O(n) states and where every winning policy for N <sup>n</sup> requires at least 2 <sup>n</sup> memory states.

We illustrate the witness. Consider a family of MEMDPs {N <sup>n</sup>}n, where N <sup>n</sup> has 2n MDPs, 4n states partitioned into two parts, and at most 2n outgoing actions per state. We outline the MEMDP family in Figure 4. In the first part, there is only one action per state. The notation is as follows: in state s<sup>0</sup> and MDP N <sup>n</sup> 1 , we transition with probability one to state a0, whereas in N <sup>n</sup> <sup>2</sup> we transition with probability one to state b0. In every other MDP, we transition with probability one half to either state. In state s1, we do the analogous construction for environments 3, 4, and all others. A path s0b<sup>1</sup> . . . is thus consistent with every MDP except N <sup>n</sup> 1 . The first part ends in state sn. By construction, there are 2<sup>n</sup> paths ending in sn. Each of them is (in)consistent with a unique set of n environments. In the second part, a policy may guess n times an environment by selecting an action α<sup>i</sup> for every i ∈ [2n]. Only in MDP N <sup>n</sup> i , action α<sup>i</sup> leads to a target state. In all other MDPs, the transition leads from state g<sup>j</sup> to gj+1. The state gn+1 is absorbing in all MDPs. Importantly, after taking an action α<sup>i</sup> and arriving in gj+1, there is (at most) one more MDP inconsistent with the path.

Every MEMDP N <sup>n</sup> in this family has a winning policy which takes σ(π · gi) = α2i−<sup>1</sup> if a<sup>i</sup> ∈ π and σ(π · gi) = α2<sup>i</sup> otherwise. Furthermore, when arriving in state sn, the state of a finite memory controller must reflect the precise set of environments consistent with the history. There are 2<sup>n</sup> such sets. The proof shows that if we store less information, two paths will lead to the same memory state, but with different sets of environments being consistent with these paths. As we

can rule out only n environments using the n actions in the second part of the MEMDP, we cannot ensure winning in every environment.

## 5 A Partial Game Exploration Algorithm

In this section, we present an algorithm for the policy problem. We tune the algorithm towards runtime instead of memory complexity, but aim to avoid running out of memory. We use several key ingredients to create a pragmatic variation of Alg. 1, with support for extracting the winning policy.

First, we use an abstraction from BOMDPs to a belief stochastic game (BSG) similar to [45] that reduces the number of states and simplifies the iterative construction<sup>6</sup> . Second, we tailor and generalize ideas from bounded model checking [6] to build and model check only a fragment of the BSG, using explicit partial exploration approaches as in, e.g., [33,9,42,29]. Third, our exploration does not continuously extend the fragment, but can also prune this fragment by using the model checking results obtained so far. The structure of the BSG as captured by the environment graph makes the approach promising and yields some natural heuristics. Fourth, the structure of the winning region allows to generalize results to unseen states. We thereby operationalize an idea from [26] in a partial exploration context. Finally, we analyze individual MDPs as an efficient and significant preprocessing step. In the following we discuss these ingredients.

Abstraction to Belief Support Games. We briefly recap stochastic games (SGs). See [38,17] for more details.

Definition 11 (SG). A stochastic game is a tuple B = hM, S1, S2i, where M = hS, A, ιinit, pi is an MDP and (S1, S2) is a partition of S.

S<sup>1</sup> are Player 1 states, and S<sup>2</sup> are Player 2 states. As common, we also 'partition' (memoryless deterministic) policies into two functions σ<sup>1</sup> : S<sup>1</sup> → A and σ<sup>1</sup> : S<sup>2</sup> → A. A Player 1 policy σ<sup>1</sup> is winning for state s if Pr(T | σ1, σ2) for all σ2. We (re)use Win<sup>T</sup> B<sup>N</sup> to denote the set of states with a winning policy.

We apply a game-based abstraction to group states that have the same observation. Player 1 states capture the observation in the BOMDP, i.e., tuples hs, Ji of MEMDP states s and subsets J of the environments. Player 1 selects the action a, the result is Player 2 state hhs, Ji, ai. Then Player 2 chooses an environment j ∈ J, and the game mimics the outgoing transition from hs, j, Ji, i.e., it mimics the transition from s in N<sup>j</sup> . Formally:

Definition 12 (BSG). Let G<sup>N</sup> be a BOMDP with G<sup>N</sup> = hhS, A, ιinit, pi, Z, Oi. A belief support game B<sup>N</sup> for G<sup>N</sup> is an SG B<sup>N</sup> = hhS 0 , A<sup>0</sup> , ι0 init, pi, S1, S2i with S <sup>0</sup> = S<sup>1</sup> ∪ S<sup>2</sup> as usual, Player 1 states S<sup>1</sup> = Z, Player 2 states S<sup>2</sup> = Z × A, actions A<sup>0</sup> = A ∪ I, initial distribution ι 0 init(hs, Ii) = P i∈I ιinit(hs, i, Ii), and the (partial) transition function p defined separately for Player 1 and 2:

$$p'(z,a) = \mathsf{dirac}(\langle z,a \rangle) \tag{Player 1}$$

$$p'(\langle z, a \rangle, j, z') = p(\langle s, j, J \rangle, a, \langle s', j, J' \rangle) \text{ with } z = \langle s, J \rangle, z' = \langle s', J' \rangle \quad \text{(Player 2)}$$

<sup>6</sup> At the time of writing, we were unaware of a polytime algorithm for BOMDPs.


Algorithm 2 Policy finding algorithm

Lemma 13. An (acyclic) MEMDP N with target states T is winning if(f ) there exists a winning policy in the BSG B<sup>N</sup> with target states TZ.

Thus, on acyclic MEMDPs, a BSG-based algorithm is sound and complete, however, on cyclic MDPs, it may not find the winning policy. The remainder of the algorithm is formulated on the BSG, we use sliced BSGs as the BSG of a sliced BOMDP, or equivalently, as a BSG with some states made absorbing.

Main algorithm. We outline Algorithm 2 for the policy problem. We track the sets of almost-sure observations and losing observations (states in the BSG). Initially, target states are winning. Furthermore, via a simple preprocessing, we determine some winning and losing states on the individual MDPs.

We iterate until the initial state is winning or losing. Our algorithm constructs a sliced BSG and decides on-the-fly whether a state should be a frontier state, returning the sliced BSG and the used frontier states. We discuss the implementation below. For the sliced BSG, we compute the winning region twice: Once assuming that the frontier states are winning, once assuming they are loosing. This yields an approximation of the winning and losing states, see Lemma 4. From the winning states, we can extract a randomized winning policy [13].

Soundness. Assuming that the B<sup>N</sup> is indeed a sliced BSG with frontier F. Then the following invariant holds: W ⊆ Win<sup>T</sup> B<sup>N</sup> and L ∩ Win<sup>T</sup> <sup>B</sup><sup>N</sup> = ∅. This invariant exploits that from a sliced BSG we can (implicitly) slice the complete BSG while preserving the winning status of every state, formalized below. In future iterations we only explore the implicitly sliced BSG.

$$\text{Lemma 14.}\text{ } Given\text{ }W\subseteq\mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}}^{T\_{\mathsf{B}\_{\mathcal{N}}}}\text{ and }L\subseteq S\text{ }\mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}}^{T\_{\mathsf{B}\_{\mathcal{N}}}}:\mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}}^{T\_{\mathsf{B}\_{\mathcal{N}}}}=\mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}|W\cup L}^{T\_{\mathsf{B}\_{\mathcal{N}}\cup\mathcal{W}}}$$

Termination depends on the sliced game generation. It suffices to ensure that in the long run, either W or L grow as there are only finitely many states. If W and L remain the same longer than some number of iterations, W ∪ L will be used as frontier. Then, the new game will suffice to determine if s ∈ W in one shot.

Generating the sliced BSG. Algorithm 3 outlines the generation of the sliced BSG. In particular, we explore the implicit BSG from the initial state but make every state that we do not explicitly explore absorbing. In every iteration, we first check if there are states in Q left to explore and if the number of explored states


in E is below a threshold Bound[i]. Then, we take a state from the priority queue and add it to E. We find new reachable states<sup>7</sup> and add them to the queue Q.

Generalizing the winning and losing states. We aim to determine that a state in the game B<sup>N</sup> is winning without ever exploring it. First, observe:

Lemma 15. A winning policy in MEMDP N is winning in N<sup>↓</sup><sup>J</sup> for any J.

A direct consequence is the following statement for two environments J<sup>1</sup> ⊆ J2:

$$
\langle s, J\_2 \rangle \in \mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}}^T \quad \text{implies} \quad \langle s, J\_1 \rangle \in \mathsf{Win}\_{\mathcal{B}\_{\mathcal{N}}}^T.
$$

Consequently, we can store W (and symmetrically, L) as follows. For every MEMDP state s ∈ S, W<sup>s</sup> = {J | hs, Ji ∈ W} is downward closed on the partial order P = (I, ⊂). This allows for efficient storage: We only have to store the set of pairwise maximal elements, i.e., the antichain,

$$W\_s^{\max} = \{ J \in W\_s \mid \forall J' \in W\_s \text{ with } J \nsubseteq J' \}.$$

To determine whether hs, Ji is winning, we check whether J ⊆ J 0 for some J <sup>0</sup> ∈ Wmax s . Adding J to Wmax s requires removing all J <sup>0</sup> ⊆ J and then adding J. Note, however, that |Wmax s | is still exponential in |I| in the worst case.

Selection of heuristics. The algorithm allows some degrees of freedom. We evaluate the following aspects empirically. (1) The maximal size bound[i] of a sliced BSG at iteration i is critical. If it is too small, the sets W and L will grow slowly in every iteration. The trade-off is further complicated by the fact that the sets W and L may generalize to unseen states. (2) For a fixed bound[i], it is unclear how to prioritize the exploration of states. The PSPACE algorithm suggests that going deep is good, whereas the potential for generalization to unseen states is largest when going broad. (3) Finally, there is overhead in computing both W and L. If there is a winning policy, we only need to compute W. However, computing L may ensure that we can prune parts of the state space. A similar observation holds for computing W on unsatisfiable instances.

Remark 1. Algorithm 2 can be mildly tweaked to meet the PSPACE algorithm in Algorithm 1. The priority queue must ensure to always include complete

<sup>7</sup> In l. 5 we do not rebuild the game B from scratch but incrementally construct the data structures. Likewise, reachable states are a direct byproduct of this construction.

Fig. 5: Performance of baselines and novel PaGE algorithm

(reachable) local BSGs and to explore states hs, Ji with small J first. Furthermore, W and L require regular pruning, and we cannot extract a policy if we prune W to a polynomial size bound. Practically, we may write pruned parts of W to disk.

## 6 Experiments

We highlight two aspects: (1) A comparison of our prototype to existing baselines for POMDPs, and (2) an examination of the exploration heuristics. The technical report [41] contains details on the implementation, the benchmarks, and more results.

Implementation. We provide a novel PArtial Game Exploration (PaGE) prototype, based on Algorithm 2, on top of the probabilistic model checker Storm [22]. We represent MEMDPs using the Prism language with integer constants. Every assignment to these constants induces an explicit MDP. SGs are constructed and solved using existing data structures and graph algorithms.

Setup. We create a set of benchmarks inspired by the POMDP and MEMDP literature [26,12,21]. We consider a combination of satisfiable and unsatisfiable benchmarks. In the latter case, a winning policy does not exist. We construct POMDPs from MEMDPs as in Definition 5. As baselines, we use the following two existing POMDP algorithms. For almost-sure properties, a belief-MDP construction [7] acts similar to an efficiently engineered variant of our gameconstruction, but tailored towards more general quantitative properties. A SATbased approach [26] aims to find increasingly larger policies. We evaluate all benchmarks on a system with a 3GHz Intel Core i9-10980XE processor. We use a time limit of 30 minutes and a memory limit of 32 GB.

Results. Figure 5 shows the (log scale) performance comparisons between different configurations<sup>8</sup> . Green circles reflect satisfiable and red crosses unsatisfiable benchmarks. On the x-axis is PaGE in its default configuration. The first plot compares to the belief-MDP construction. The tailored heuristics and representation of the belief-support give a significant edge in almost all cases. The few points

<sup>8</sup> Every point hx, yi in the graph reflects a benchmarks which was solved by the configuration on the x-axis in x time and by the configuration on the y-axis in y time. Points above the diagonal are thus faster for the configuration on the x-axis.


Table 1: Satisfiable and unsatisfiable benchmark results


below the line are due to a higher exploration rate when building the state space. The second plot compares to the SAT-based approach, which is only suitable for finding policies, not for disproving their existence. This approach implicitly searches for a particular class of policies, whose structure is not appropriate for some MEMDPs. The third plot compares PaGE in the default configuration – with negative entropy as priority function – with PaGE using positive entropy. As expected, different priorities have a significant impact on the performance.

Table 1 shows an overview of satisfiable and unsatisfiable benchmarks. Each table shows the number of environments, states, and actions-per-state in the MEMDP. For PaGE, we include both the default configuration (negative entropy) and variation (positive entropy). For both configurations, we provide columns with the time and the maximum size of the BSG constructed. We also include the time for the two baselines. Unsurprisingly, the number of states to be explored is a good predictor for the performance and the relative performance is as in Fig. 5.

## 7 Conclusion

This paper considers multi-environment MDPs with an arbitrary number of environments and an almost-sure reachability objective. We show novel and tight complexity bounds and use these insights to derive a new algorithm. This algorithm outperforms approaches for POMDPs on a broad set of benchmarks. For future work, we will apply an algorithm directly on the BOMDP [16].

## Data-Availability Statement

Supplementary material related to this paper is openly available on Zenodo at: https://doi.org/10.5281/zenodo.7560675

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Mungojerrie: Linear-Time Objectives in Model-Free Reinforcement Learning<sup>⋆</sup>

Ernst Moritz Hahn<sup>1</sup> , Mateo Perez<sup>2</sup> , Sven Schewe<sup>3</sup> , Fabio Somenzi2() , Ashutosh Trivedi<sup>2</sup> , and Dominik Wojtczak<sup>3</sup>

> University of Twente, Enschede, The Netherlands University of Colorado Boulder, Boulder, USA fabio@colorado.edu University of Liverpool, Liverpool, UK

Abstract. Mungojerrie is an extensible tool that provides a framework to translate linear-time objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for ω-regular objectives, and an internal probabilistic model checker for ω-regular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports fnite models specifed in PRISM and ω-automata specifed in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for ω-regular objectives in RL.

## 1 Introduction

Reinforcement learning (RL) [41] is a sequential optimization approach where a decision maker learns to optimally resolve a sequence of choices based on feedback received from the environment. This feedback often takes the form of rewards and punishments proportional to the ftness of the decisions taken by the agent (or their efects) as judged by the environment towards some higherlevel objectives. We call such objectives learning objectives. RL is inspired by the way dopamine-driven organisms latch on to past rewarding actions and hence, historically, RL adopted a myopic way of looking at the reward sequences in the form of the discounted-sum of rewards, where the discount factor controls the weight placed toward future rewards. More recently, other forms of reward aggregation, such as limit-average, have also been considered. A key design challenge for users of RL is that of translation: given a class of learning objectives and aggregator functions, design a reward function from the sequence of learner's choices to scalar rewards such that an RL agent maximizing the aggregated sum of rewards converges to an optimal policy for the learning objective.

© The Author(s) 2023

<sup>⋆</sup> Mungojerrie is available at plv.colorado.edu/mungojerrie. This work is supported in part by the National Science Foundation (NSF) grant CCF-2009022 and by NSF CA-REER award CCF-2146563. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 864075 (CAESAR) and 956123 (FOCETA).

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. https://doi.org/10.1007/978-3-031-30823-9 27 527–545, 2023.

Fig. 1. The reinforcement learning loop implemented within Mungojerrie. The interpreter assigns reward to the agent based on the state of the model and automaton.

The translation of objectives to reward signals has historically been a largely manual process. Such translations not only depend on the expertise of the translator in reward engineering, they also pose obstacles to providing formal guarantees on the faithfulness of the translation. Unsurprisingly, specifying reward manually is prone to error [22,44]. As the practice of model-free RL continues to produce impressive results [38,31,29], the integration of RL in safety-critical system design is inevitable. An alternative to manually programming the reward function is to specify the objective in a formal language and have it "compiled" to a reward function. We call such a translation a reward scheme.

In designing reward schemes for RL, one strives to achieve an overall translation that is faithful (maximizing reward means maximizing the probability of achieving the objective) and effective (RL quickly converges to optimal strategies). While the faithfulness of a reward scheme can be established theoretically, its effectiveness requires experimental evaluation. Experimenting with reward schemes requires a framework for specifying learning objectives, environments, a wide range of RL algorithms, and an interface for connecting reward schemes with these components. In addition, it may be beneficial to have access to a probabilistic model checker to evaluate the quality of the policy computed by RL, and to compare it against ground truth.

Mungojerrie is designed to provide this functionality for learning requirements expressible as linear-time objectives (ω-regular languages [32] and linear temporal logic [27,33]) against finite MDPs and stochastic games.

Features. Mungojerrie is designed with ease of use and extensibility in mind. Models in Mungojerrie can be specified in PRISM [25], which maintains compatibility with existing benchmarks, or by explicitly constructing the model via calls to internal functions. Mungojerrie supports reading ω-automata in the Hanoi Omega Automata (HOA) format [2], and has a command line interface connecting Mungojerrie with performant LTL translators (Spot [7] and Owl [24]). Mungojerrie provides an OpenAI Gym [4] like interface between the RL algorithms (included with the tool) and the learning environment to allow integration with of-the-shelf RL algorithms. The tool also has methods for performing probabilistic model checking (including end-component decomposition, stochastic shortest-path, and discounted-reward optimization) of ω-regular objectives on the same data structures used for learning. Mungojerrie also provides reference implementations of several reward schemes [11,12,14,19,23] proposed by the formal methods community. Mungojerrie is packaged with over 100 benchmarks and outputs GraphViz [8] for easy visualization of small models and automata.

An introductory example. Figure 2 shows an example MDP in which a gambler places bets with the aim of accumulating a wealth of 7 units. In addition the gambler will quit if her wealth wanes to just one unit more than once. This objective is captured by the (deterministic) B¨uchi automaton of Fig. 3. Mungojerrie computes a strategy for the gambler that maximizes the probability of satisfying her objective. Figure 4 shows the Markov chain that results from following this strategy. This fgure was minimally modifed from GraphViz output from Mungojerrie. Note that the strategy altogether avoids the state in which x = 1; hence it achieves the same probability of success (5/7) as an optimal strategy for the simpler objective of eventually reaching x = 7 (without going broke). Mungojerrie computes the strategy of Fig. 4 by RL; it can also verify it by probabilistic model checking.

## 2 Overview of Mungojerrie

Models. The systems used in Mungojerrie consist of fnite sets of states and actions, where states are labeled with atomic propositions. There are at most two strategic players: Max player and Min player. Each state is controlled by one player. We call models where all states are controlled by Max player Markov decision processes (MDPs) [34]. Else, we refer to them as stochastic games [5].

Mungojerrie supports parsing models specifed in the PRISM language. The allowed model types are "mdp" (Markov decision process) and "smg" (stochastic multiplayer game) with two players. There should be one initial state. The interface for building the model is exposed, allowing extensions of Mungojerrie to connect with parsers for other languages. The authors of [6] used Mungojerrie in their experiments by extending the tool to support continuous-time MDPs.

Properties. The properties natively supported by Mungojerrie are ω-regular languages. Starting from the initial state, the players produce an infnite sequence of states with a corresponding infnite sequence of atomic propositions: an ω-word. The inclusion of this ω-word in our ω-regular language determines whether or not this particular run satisfes the property. The Max player maximizes the probability that a run is satisfying, while goal of the Min player is the opposite.

We specify our ω-regular language as an ω-automaton, which may be nondeterministic. For model checking and RL, this nondeterminism must be resolved on the fy. Automata where this can be done in any MDP without changing acceptance are said to be Good-for-MDPs (GFM) [13]. Automata where this can be done in any stochastic game without changing acceptance are said to be Good-for-Games (GFG) [21]. In general, nondeterministic B¨uchi automata are not GFM, but two classes of GFM B¨uchi automata with limited nondeterminism have been studied: suitable limit-deterministic B¨uchi automata [10,37] and slim B¨uchi automata [13].

The user of Mungojerrie can either provide the ω-automaton directly or use one of the supported external translators to generate the automaton from LTL with a single call to Mungojerrie. Mungojerrie reads automata specifed in the HOA format. Mungojerrie supports providing the ω-automaton directly for testing the efectiveness of diferent automata for learning (see Section 4). The LTL translators that can be called from Mungojerrie are the ePMC plugin from [13], Spot [7], and Owl [24] for generating slim B¨uchi, deterministic parity, and suitable limit-deterministic B¨uchi automata. The user is responsible for the ω-automata provided directly having the appropriate property, GFM or GFG.

For use in Mungojerrie, the labels and acceptance conditions for the automaton should be on the transitions. The acceptance conditions supported by

```
0 mdp
1
2 c on s t i n t Wealth = 5 ; // i n i t i a l gamble r ' s w e a l t h
3 c on s t double p = 1 / 2; // p r o b a b i l i t y o f w i n n i n g one b e t
4
5 l a b e l " r i c h " = x = 7 ;
6 l a b e l " p o o r " = x = 1 ;
7
8 module gam bl e r
9 x : [ 0 . . 7 ] i n i t Wealth ;
10
11 [ b0 ] x=0 ∨ x=7 → t r u e ; // a b s o r b i n g s t a t e s
12 [ b1 ] x>0 ∧ x<7 → p : ( x '= x+1) + (1−p ) : ( x '=x−1) ;
13 [ b2 ] x>1 ∧ x<6 → p : ( x '= x+2) + (1−p ) : ( x '=x−2) ;
14 [ b3 ] x>2 ∧ x<5 → p : ( x '= x+3) + (1−p ) : ( x '=x−3) ;
15 endmodule
```
Fig. 2. A Gambler's Ruin model in the PRISM language. Line 13, for example, says that when 1 < x < 6, the gambler may bet two units because action b2 is enabled. The '+' sign does double duty: as addition symbol in arithmetic expressions and as separator of probabilistic transitions.

Fig. 3. Deterministic B¨uchi automaton equivalent to the LTL formula ¬poor U rich∨ (poor ∧ X(¬poor U rich)) . The transitions marked with the green dots are accepting.

Fig. 4. Optimal gambler strategy for the objective of Fig. 3. Boxes are decision states and circles are probabilistic choice states. For a decision state, the label gives the value of x and the state of the automaton. Transitions are labelled with either an action or a probability, and with the priority (1 for accepting and 0 for non-accepting).

Mungojerrie should be reducible to parity acceptance conditions without altering the transition structure of the automaton. This includes parity, B¨uchi, co-B¨uchi, Streett 1 (one pair), and Rabin 1 (one pair) conditions. Nondeterministic automata must have B¨uchi acceptance conditions. Generalized acceptance conditions are not supported in version 1.1.

Reinforcement Learning. The RL algorithms optimize over MDP/Stochastic game environments equipped with a Markovian reward function. The reward function assigns a reward Rt+1 ∈ R dependent on the state and action at timestep t and the next state at timestep t+ 1. As the players make their choices within the environment, the resulting play produces a sequence of states, actions, and rewards (S0, A0, R1, S1, A1, R2, . . .). The discounted reward aggregator is

$$\text{disc}\_{\gamma}(\pi,\nu) = \mathbb{E}\_{\pi,\nu} \left[ \sum\_{t \ge 0} \gamma^t R\_{t+1} \right],$$

where π is the strategy for Max player, ν is the strategy for Min player, γ ∈ [0, 1) is the discount factor, and R<sup>t</sup> is the reward at timestep t. We can set γ = 1 when with probability 1 we enter an absorbing sink (termination), where we receive no reward. This is called the episodic setting. Another well-studied RL aggregator is the limit-average reward defned as

$$\text{avg}(\pi,\nu) = \limsup\_{n \to \infty} \frac{1}{n} \mathbb{E}\_{\pi,\nu} \left[ \sum\_{n \ge t \ge 0} R\_{t+1} \right].$$

The limit-average reward aggregator is natural in the continuing setting, where the agent's trajectory is never reset and there is no preferred initial state [30]. The objective of RL is to compute the optimal value and policies for a given aggregator. Mungojerrie includes the stochastic game extensions of Q-learning [43], Double Q-learning [20], and Sarsa(λ) [40] for RL in fnite state and action models. Mungojerrie also includes Diferential Q-learning [42] for average RL in fnite communicating MDPs. We collectively refer to parameters that are set by hand prior to running an RL algorithm as hyperparameters. Mungojerrie supports changing all hyperparameters from the command line. As the design of Mungojerrie separates the learning agent(s) from the reward scheme, extending Mungojerrie to include another RL algorithm is easy.

Reward Schemes. The user of Mungojerrie can either select one of the reward schemes included with the tool or extend the tool to include a new reward scheme. Mungojerrie also allows the use of the reward specifed in the PRISM model (either state- or action-based). The following reward schemes are included in version 1.1 of Mungojerrie:

– Limit-reachability. The limit-reachability scheme [11] uses a GFM B¨uchi automaton. This reward scheme converts accepting edges in the automaton into a transition to a sink with probability 1−ζ with a reward of +1, where 0 < ζ < 1 is a hyperparameter. All other transitions produce zero reward. For a sufciently large ζ and discount factor γ, strategies that are optimal for the discounted reward maximize the probability of satisfaction of the B¨uchi objective.

– Multi-discounted. The multi-discounted reward scheme [3] also uses a GFM B¨uchi automaton. This translation converts accepting edges in the automaton into a transition that gives 1−γ<sup>B</sup> reward with a discount of γB, where 0 < γ<sup>B</sup> < 1 is a hyperparameter. All other transitions yield no reward and are discounted by the standard discount factor γ. For suitably large γ<sup>B</sup> and γ, discounted reward optimal strategies maximize the probability of satisfaction of the B¨uchi objective. – Dense limit-reachability. The dense limit-reachability reward scheme [12] connects the approaches of [11] and [3]. This reward scheme is identical to [11] except for giving a +1 reward given every time an accepting transition is seen, instead of only when the transition to the sink succeeds. Since discounting can be thought of as a constant stopping probability [41], this reward scheme is the same in expectation as a scaled version of [3].

– Parity. The parity reward scheme was proposed for stochastic games in [14]. For two-player games, it requires a GFG automaton. This translation utilizes a deterministic parity automaton with a max odd objective. Transitions of priority i go to a sink with probability ε k−i , where k is the number of priorities and 0 < ε < 1 is a hyperparameter. The transition to the sink receives a +1 or −1 reward for odd or even priorities, respectively. All other transitions receive a zero reward. For sufciently small ε, maximizing the cumulative reward results in a strategy maximizing the probability of satisfaction of the parity objective.

– Priority tracker. The priority tracker reward scheme was proposed by Hahn et al. [14]. For MDPs, Hahn et al. introduce a priority tracker gadget that takes a parity objective with a hyperparameter 0 < ε < 1. The priority tracker consists of two stages. In stage one, we wait for transients to end by ending the stage with probability ε on each step. In the second stage, we detect the maximum priority occuring infnitely often with a set of wait states, where we accept the current maximum with probability ε on each step. For sufciently small ε and large discount γ, maximizing the discounted reward also maximizes the probability of satisfaction of the parity objective.

– Lexicographic. Hahn et al.[19] proposed this reward scheme for lexicographic ω-regular objectives. In this reward scheme, there is a tracker gadget that keeps track of which accepting edges for the GFM B¨uchi automata have been seen. When the tracker indicates that at least one accepting edge has been seen, the learning agent can decide to "cash in" the tracker, which clears the tracker. When this happens, with probability 1 − ζ the learning agent receives a reward which is the weighted sum of seen accepting edges, scaled by powers of f, and transitions to a terminating sink, where 0 < ζ < 1 and f ≥ 1 are hyperparameters. For suitable f, ζ, and γ, maximizing the discounted reward yields the lexicographically optimal strategy.

– Average. The average reward scheme [23] translates absolute liveness ω-regular objectives, which means the objective is concerned with eventual satifaction, to average reward for communicating MDPs. Given a GFM B¨uchi automaton, transitions from every state in the automaton back to the initial state are introduced, so called "resets". A hyperparameter c < 0 is introduced which gives a penalizing reward to these resets. Accepting edges are then given a reward of +1. Positional policies that maximize the average reward also maximize the probability of satisfaction of the objective.

– Reward on accept. This reward scheme was proposed in [35]. The translation of [35] picks a pair in a Rabin automaton to satisfy, and gives positive and negative reward for the good and bad states of the pair, respectively. In general, picking the winning pair ahead of time is not possible [11]. For a B¨uchi automaton, this corresponds to giving positive (+1) rewards for accepting edges and zero rewards otherwise. While this reward scheme was shown to be not faithful [11] for general objectives, it is included for comparison purposes.

## 3 Tool Design

The primary design goal of Mungojerrie is to enable extensibility. To accomplish this, Mungojerrie separates diferent processing stages as much as possible so that extensions can reuse other components. We begin by presenting the architecture

Fig. 5. Architecture of Mungojerrie 1.1.

of Mungojerrie. Afterwards, we take a closer at the novel slim B¨uchi automata plugin, which is described here in detail for the frst time.

Architecture of Mungojerrie. Mungojerrie begins its execution by parsing the input PRISM and HOA (see upper part of Fig. 5). The HOA is either read in from a fle or piped from a call to one of the supported LTL translators. In particular the ePMC plugin from [13], an LTL translator capable of producing slim B¨uchi automata, is packaged with the tool. Requested automaton modifcations, such as determinization, are run after this step. If specifed, Mungojerrie creates the synchronous product between the automaton and the model, and runs model checking or game solving [1,15,16]. The requested strategy and values are returned. Due to this step, Mungojerrie has been connected to external linear program solvers. This enabled the extension of Mungojerrie to compute reward maximizing policies via a linear program for branching Markov decision processes in [18].

If learning has been specifed, the interpreter takes the automaton and model, without explicitly forming the product, and provides an interface akin to OpenAI Gym [4] for the RL agent to interact with the environment and receive rewards. When learning is complete, the Q-table(s) can be saved to a fle for later use, and the interpreter forms the Markov chain induced by the learned strategy and passes it to the internal model checker for verifcation.

Fig. 6. Automata generation block diagram

Slim B¨uchi Automata Generation. For reward schemes involving LTL, the ω-regular automata translation is an important part of the design. Certain automata may be more efective for learning than others. Slim B¨uchi automata [13] were designed with learning considerations in mind. The translator that

produces these automata is packaged with Mungojerrie. We will now describe its design in detail for the frst time.

We have implemented slim B¨uchi automata generation as a plugin of the probabilistic model checker ePMC [17]. The process is described in Fig. 6. The starting point is a transition-labeled B¨uchi automaton in HOA format [2] (2) or an LTL formula (1). In case we are given an automaton in HOA format, we parse this automaton (4) and if we are given an LTL formula, we use the tool Spot [7] to transform the formula into an automaton (3). In both cases, we end up with a transition-labeled B¨uchi automaton (5).

Afterwards, we have two options. The frst option is to transform (6) this automaton into a slim B¨uchi automaton (8) [13]. These automata can then be directly composed with MDPs for model checking or used to produce rewards for learning. The other option is to construct (7) a suitable limit-deterministic B¨uchi automaton (SLDBA) (9). Automata of this type consist of an initial part and a fnal part. A nondeterministic choice only occurs when moving from the initial to the fnal part by an ε transition (a transition without reading a character). SLDBA can be directly composed with MDPs. However, SLDBA directly constructed from general B¨uchi automata are often quite large, which in turn also means that the product with MDPs would be quite large as well. Therefore, we have implemented further optimization steps. We can apply a number of algorithms to minimize (10) this automaton so as to achieve a smaller SLDBA (11). To do so, we implemented several methods:


– If we have a state s in the initial part for which we fnd a state s ′ in the fnal part where the language of s and s ′ are the same, we can remove all transitions of s and add an ε transition from s to s ′ instead. Afterwards, automaton states that cannot be reached anymore can be removed.

Each of these methods has a diferent potential for minimization as well as runtime. We therefore allow to specify which optimizations are to be used and in which order they are applied.

Once we have optimized the SLDBA, we could directly use it for later composition with an MDP. Another possibility is to prove that the original automaton is already good for MDPs. If this is the case, then it is often preferable to use the original automaton: being constructed by specialized tools such as Spot, it is often smaller than the minimized SLDBA. The original automaton is goodfor-MDPs if it simulates the SLDBA [13]. If it does, then it is also composable with MDPs. Otherwise, it is unknown whether it is suitable for MDPs. In this case, sometimes more complex notions of simulation can be used, but existing decision procedures are too expensive to implement [36].

To show simulation, we construct (12) a simulation game, which in our case is a transition-labeled parity game (13) with 3 colors. We solve these games using (a slight variation of) the McNaughton algorithm [28]. (We are aware that specialized algorithms for parity games with 3 colors exist [9]. However, so far the construction of the arena, not solving the game, turned out to be the bottleneck here). If the even player is winning, the simulation holds. Otherwise, more complex notions of simulation can be used, which however lead to larger parity games being constructed. In case the even player is winning for any of them, we can use the original automaton, otherwise we have to use the SLDBA. In any case, we export the result to an HOA fle (15). For illustration and debugging , automata and simulation games can be exported to the GraphViz [8].

## 4 Case Studies

To showcase how Mungojerrie can be used to experiment with diferent reward schemes, we provide three case studies. In the frst case study, we demonstrate how Mungojerrie can be used to compare the efectiveness of two diferent reward schemes on the same system. In the second case study, we consider the design space of automata, and demonstrate how Mungojerrie can be used to compare how diferent ω-automata change learning efectiveness. This is important for considering how to design LTL translators that produce automata that are efective for learning. In the last case study, we demonstrate how the different outputs of Mungojerrie can be used. For additional experimental results obtained using Mungojerrie, we refer readers to [11,12,14,19,39,45,23] for case studies testing ω-regular reward schemes, and [13] for the ePMC plugin. We also refer readers to [26, Fig. 3] which examined RL for scLTL properties, [6] for continuous-time MDPs, and [18], which extended Mungojerrie to test model-free reinforcement learning in branching Markov decision processes.

#### 4.1 Comparing Reward Schemes

To demonstrate how Mungojerrie may be used to compare reward schemes, we compare the reward scheme of [11] with a modifcation of it that assigns a +1 reward on every accepting edge, as introduced in [12]. We compare these two methods on the same problem, where the learner must safely navigate two robots on a slippery gridworld to a goal. We also fx the problem parameters ζ = 0.99 and γ = 0.99999, and the use of Q-learning. Since we are interested in which method will converge sooner, we fx the amount of training to be relatively low. We allow the two parameters specifc to Q-learning, the learning rate α and the exploration rate ε, to be varied in order to fnd the optimal combination for each method. We average 10 runs for each grid point. This required 32000 runs, which took approximately 79 CPU hours (single-core) on a 2.5GHz Intel Xeon E5-2680 v3. This corresponds to an average of approximately 188000 sampled transitions per second per core, including model checking time. This sampling rate is typical of what was observed in other experiments.

Figure 7 shows the probability of satisfaction of the learned strategy as computed by the model checker of Mungojerrie. One can see that under these conditions, the reward scheme from [12] is able to consistently learn probability

Fig. 7. Probability of satisfaction of learned strategies as computed by the model checker of Mungojerrie. 'Hahn et al. 19' refers to the translation of [11]. 'Hahn et al. 20' refers to the translation of [12] that assigns +1 reward on every accepting edge with reachability parameter ζ. Each grid point is the average of 10 runs.

1 strategies under certain parameter combinations, while [11] does not. Figure 8 shows the difference in the estimated probability of satisfaction, found by taking the value from the initial state of Q-table and renormalizing it appropriately, and the probability of satisfaction of the learned strategy computed by the model checker of Mungojerrie. One can see that the reward scheme of [11] sometimes overestimates and sometimes underestimates when it achieves a high actual probability of satisfaction under these conditions. However, on the same example, the reward scheme of [12] consistently underestimates everywhere. In summary, Mungojerrie allowed us to see that, although the reachability reward scheme of [12] may achieve higher probabilities of satisfaction sooner, it may take longer for the values in the Q-table to properly converge.

#### 4.2 Comparing Automata

An ω-regular objective may be described by different automata, many of which may be good-for-MDPs. Mungojerrie can be used to compare the effectiveness of such automata when used in RL. Consider the two nondeterministic B¨uchi automata shown in Fig. 9. Both are equivalent to the LTL formula (F Gx) ∨ (G F y), but the one on the right should be better for learning: long transient sequences of observations that satisfy x ∧ ¬y may convince the agent to commit to State 1 of the left automaton too soon.

To test this conjecture, we specified a model in PRISM organized in two long chains. In one of them the agent sees many xs for a while, but eventually only sees ys. In the other chain the situation is reversed. Which chain is followed is up to chance. We then used the reward scheme from [3] with Q-learning under the default hyperparameters in Mungojerrie, γ<sup>B</sup> = 0.99, γ = 0.99999, α = 0.1, and ε = 0.1. We then trained for 20000 episodes under each automaton, and used Mungojerrie to compute the probability of satisfaction of the property at periodic

Fig. 8. Estimated probability of satisfaction of learned strategies minus the probability of satisfaction computed by the model checker of Mungojerrie. Blue indicates underestimation, while red indicates overestimation. Hahn et al. 19 refers to the translation of [11]. Hahn et al. 20 refers to the translation of [12] that assigns +1 reward on every accepting edge with reachability parameter ζ. Each grid point is the average of 10 runs.

intervals. Since learning to control the left automaton requires thorough and deep exploration, we conjectured that optimistic intialization of the Q-table [41] to the value 0.8 will improve performance. We took the average of 1000 runs for each combination.

Figure 10 shows the resulting curve. When using the LDBA without optimistic intialization, the learning agent is unable to learn the optimal strategy under these conditions. While it is worth noting that using the LDBA without optimistic initialization eventually converges to the optimal strategy with enough training, it is clear that the choice of the automaton can have a significant impact on learning performance. Therefore, the design of translations from LTL to automata has a role to play in producing effective reward schemes.

Fig. 9. Equivalent, but not equally effective, B¨uchi automata. "LDBA" and "Forgiving" refer to the automaton the left and right, respectively.

Fig. 10. Plot of the evolution of the probability of satisfaction of learned strategies as computed by the model checker of Mungojerrie. "Forgiving" and "LDBA" refer to the left and right automata in Figure 9, respectively. "(optimistic)" indicates optimistic initialization of the Q-table was used. Each curve is the average of 1000 runs.

Fig. 11. A grid-world stochastic game arena (left) and a deterministic parity automaton for the objective (right).

#### 4.3 A Game of Pursuit

Figure 11 describes a stochastic parity game of pursuit in which the Max player (M) tries to escape from the Min player (m). At each round, each player in turn chooses a direction to move. If movement in that direction is not obstructed by a wall, then the player moves either two squares or one square with equal probabilities. One square of the grid is a trap, which m must avoid at all times, but M may visit finitely many times. Player M should be at least 5 squares away from player m infinitely often. This objective is described by the LTL property (F ¬trapmn) ∨ ((F G¬trapmx) ∧ (G F ¬close)), where trapmn and trapmx are true when m and M visit the trap square, respectively, and close is true when the Manhattan distance between the two players is less than 5 squares. This objective translates to the deterministic parity automaton in Fig. 11, which accepts a word if the maximum recurring priority of its run is odd.

Unlike the example of Fig. 2, inspection of the Markov chain induced by an optimal strategy and manual verification of the optimality of the learned

Fig. 12. Max player learned strategy for the game of Fig. 11 when the automaton is in State 0. (Any strategy will do when the automaton is in State 1.) In each 6 × 6 box the rose-colored square is the position of the minimizing player, while the light-blue square marks the trap.

strategy is impractical. Instead, the model checker of Mungojerrie has verifed the optimality of this strategy from the intial state. For visualization, Mungojerrie can also save the strategy in CSV format. Postprocessing can then produce a graphical representation like the one of Fig. 12. The color gradient shows that, in the main, M's strategy is to move away from m.

## 5 Conclusion

We have introduced Mungojerrie, an extensible tool for experimenting with reward schemes for RL, with a focus on ω-regular objectives. Mungojerrie allows the specifcation of models in PRISM [25] and ω-automata in HOA [2]. Multiple LTL translators can be called from the tool [7,24], including the ePMC plugin introduced in [13] for the construction of slim B¨uchi automata. Mungojerrie includes various reward schemes [11,3,12,14,19,23,35] for ω-regular objectives and model-free RL algorithms [43,20,40,23]. Mungojerrie also includes an internal probabilistic model checker for the verifcation of learned strategies against ω-regular objectives, and for allowing users to verify that developed examples are as intended. The tool also comes packaged with benchmarks for ω-regular objectives in RL.

We have discussed Mungojerrie's design and demonstrated how Mungojerrie can be used to perform comparisons of reward schemes for ω-regular objectives. The source and documentation of Mungojerrie are publicly available.

## References


ICCPS 2020, Sydney, Australia, April 21-25, 2020. pp. 98–107. IEEE (2020). https://doi.org/10.1109/ICCPS48487.2020.00017,


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Verification**

## **A Formal CHERI-C Semantics for Verification**

## Seung Hoon Park() , Rekha Pai , and Tom Melham

Department of Computer Science, University of Oxford, Oxford, UK {seunghoon.park,rekha.pai,tom.melham}@cs.ox.ac.uk

**Abstract.** CHERI-C extends the C programming language by adding *hardware capabilities*, ensuring a certain degree of memory safety while remaining efficient. Capabilities can also be employed for higher-level security measures, such as software compartmentalization, that have to be used correctly to achieve the desired security guarantees. As the extension changes the semantics of C, new theories and tooling are required to reason about CHERI-C code and verify correctness. In this work, we present a formal memory model that provides a memory semantics for CHERI-C programs. We present a generalised theory with rich properties suitable for verification and potentially other types of analyses. Our theory is backed by an Isabelle/HOL formalisation that also generates an OCaml executable instance of the memory model. The verified and extracted code is then used to instantiate the parametric *Gillian* program analysis framework, with which we can perform concrete execution of CHERI-C programs. The tool can run a CHERI-C test suite, demonstrating the correctness of our tool, and catch a good class of safety violations that the CHERI hardware might miss.

**Keywords:** CHERI-C · Hardware Capabilities · Memory Model · Semantics · Theorem Proving · Verification

## **1 Introduction**

Despite having been developed more than 40 years ago, C remains a widely used programming language owing to its efficiency, portability, and suitability for lowlevel systems code. The language's lack of inherent memory safety, however, has been the source of many serious issues [18]. While there have been significant efforts aimed at vulnerability mitigation, memory safety issues remain widespread, with a recent study stating that 70% of security vulnerabilities are caused by memory safety issues [31].

The Capability Hardware Enhanced RISC Instructions (CHERI) project offers an alternative model that provides better memory safety [44]. Its main features include a new machine representation of C pointers called *capabilities* and extensions to existing Instruction Set Architectures (ISA) that enable the secure manipulation of capabilities. Capabilities are in essence memory addresses bound to additional safety-related metadata, such as access permissions and bounds on the memory locations that can be accessed. As the hardware performs the safety checks on capabilities, legacy C programs compiled and run tag :: 1 bit

(a) CHERI-256 Capability Layout

(b) CHERI-128 Capability Layout

Fig. 1: Simplified CHERI Capability Layouts

on CHERI architecture, i.e. CHERI-C code, acquire hardware-ensured spatial memory safety, while retaining efficiency. Porting code from one language to another generally requires significant efforts. But porting C codes to CHERI-C requires little, if any, changes to the original code to ensure the code runs on CHERI hardware [36, 39].

In 2019, the UK announced its *Digital Security by Design* programme with £190 million of funding distributed over more than 26 research projects and 5 industrial demonstrators [6] to 'radically update the foundation of our insecure digital computing infrastructure, by demonstrating that mainstream processor technology . . . can be updated to include new security technologies based on the CHERI Architecture' [5]. A cornerstone of the programme is Morello [4], a CHERI-enabled prototype developed by Arm.

Over the several years that lead to the realisation of Morello, there were several design revisions made to the hardware; examples are depicted in Fig. 1. The refined designs used methods for compression of bounds that reduced cache footprints and improved overall performance while minimising incompatibility. Morello uses a very similar design to the compressed scheme for capabilities depicted in Fig. 1b, with the overall bit-representation of the layout differing slightly. Future capability designs may possibly incorporate a different bit-representation design, provided there are improvements in performance or compatibility. Due to the ever-changing design of capability bit-representations, it seems best to have an *abstract* representation of capabilities, so that CHERIbased verification tools can remain modular.

Checking for memory safety issues of legacy C code can, of course, be achieved using existing analysis tools for C, but there are new problems that arise when such code is run on CHERI hardware. Because the pointer and memory representations are fundamentally different in a CHERI architecture, there are non-trivial differences in the semantics between C and CHERI-C.

To illustrate this point, consider the C code in Listing 1.1. This code segment performs memcpy twice: once from a to b, where pointers/capabilities are stored misaligned in b, then from b to c, where pointers/capabilities are stored correctly again in c. In standard C, there are no problems accessing the pointer stored in c. But in CHERI-C, misaligned capabilities in memory are invalidated. That means the address and meta-data of the misaligned capabilities are accessible, but such capabilities can no longer be dereferenced [41]. While c will contain the same capability value as that of a, the capability stored in c is invalidated. Thus, the last line will trigger an 'invalid tag' exception when the code is executed on ARM Morello and other CHERI-based machines.

```
1 #include <stdlib.h>
2 #include <string.h>
3 void main(void) {
4 int *n = calloc(sizeof(int), 1);
5 int **a = malloc(sizeof(int *));
6 *a = n;
7 int **b = malloc(sizeof(int *) * 2);
8 int **c = malloc(sizeof(int *));
9 memcpy((char *) b + 1, a, sizeof(int *));
10 memcpy(c, (char *) b + 1, sizeof(int *));
11 int x = **c;
12 }
```
#### Listing 1.1: C code example

Of course, existing C analysis tools cannot catch these cases, as such tools are not only unaware of the changes in the semantics that capabilities bring, but also the code is not problematic in conventional C. Moreover, while CHERI ensures spatial safety by the hardware, CHERI is still incapable of catching temporal safety violations, such as Use After Free (UAF) violations. There exists work that attempt to address temporal safety [11, 17, 42], but they are either a softwareimplemented solution [42], where overall performance is inevitably affected, or ongoing work [11]. There is, therefore, a need for program analysis tools that correctly integrate the semantics of CHERI-C.

To the best of our knowledge, there is no prior work on formalising a CHERI-C memory model. The Cerberus C work [30] is primarily designed to capture pointer provenance of C programs and uses CHERI-C as a reference for pointer provenance, but the tool lacks a formal CHERI-C memory model. ESBMC is a verification tool that supports CHERI-C code [15]. But support for tagged memory does not yet exist; ESBMC would not be able to catch the 'invalid tag' exception in the code in Listing 1.1. Furthermore, ESBMC's memory model is not formally verified. Users of ESBMC must trust that the implementation of the memory model and its underlying theory are correct. SAIL formalisations for each CHERI architectures exist [3, 8, 9], but they only capture the low-level semantics of the architecture and not high-level C constructs such as allocation.

In this paper, we introduce a formal CHERI-C memory model that captures the memory semantics of the CHERI-C language. In Sect. 3, We formalise the memory and its operations and prove essential properties that provide correctness guarantees. We provide a rigorous logical formalisation of the CHERI-C memory model in Isabelle/HOL [32] (in Sect. 4.1) and use the code generation feature to generate a verified OCaml instance of the memory model [21]. We then show, in Sect. 4.2, the practical aspects of this work by providing the memory model to, and thereby instantiating, Gillian [20], a general, parametric verification framework that supports concrete and symbolic execution and verification based on separation logic, backed by rich correctness properties. In Sect. 5, we demonstrate that the tool can capture the semantics of CHERI-C programs correctly. A discussion on the existing works can be found in Sect. 6 while Sect. 7 concludes this paper mentioning possible future directions. We first start with an introduction to the CHERI architecture.

## **2 CHERI**

CHERI extends a conventional ISA by introducing *capabilities* which are essentially pointers that come along with metadata to restrict memory access. The ISA now has additional hardware instructions and exceptions that operate over capabilities. Register sets are extended to include capability registers, instructions are added that reference the capability registers, and custom hardware exceptions are added to block operations that would violate memory safety. Designs of CHERI capabilities have refined over the past several years and have been incorporated in several existing architectures, such as MIPS and RISC-V [40]. All CHERI-extended ISAs have been formally defined using the SAIL specification language, in which the logic of machine instructions and memory layout have been defined formally in a first-order language [13].

Regardless of the layout, CHERI capabilities include three important types of high-level information, in addition to a 64-bit address:


Fig. 1a show a 256-bit representation of a capability, which was one of the earlier designs. The lower and upper bounds are represented using the base and length fields. Here, the lower bound is the address stated by the base field, and the upper bound is the address in the base field plus the length field. Permissions and other metadata are stored in the remaining fields as a bit vector. The capability's tag bit exists separately from the capability. Tag bits are, in practice, stored separately from the main memory where capabilities reside, so users cannot manipulate the tag bits of capabilities stored in memory. Furthermore, overwriting capabilities stored in memory with non-capability values invalidates their tag bits, which ensures capabilities cannot be forged out of thin air.

This representation, in theory, exercises a high level of compatibility with existing C code. But performance, particularly with regards to caching, is reduced due to the size of the capability representation [43]. Refined designs ultimately resulted in a capability that utilises a floating-point-based lossy compression technique on the bounds [43], such as the one depicted in Fig. 1b. In many cases, the upper bits of the address fields are most likely to overlap with those of the lower and upper bounds. Knowing this, bounds can be compressed by having the upper bits of their fields depend on that of the address, which means only the lower bits need to be stored.

The lossy compression of bounds may result in some incompatibility. Bounds may no longer be represented exactly, and changes in the address field may result in an unintentional change in the bounds. Nonetheless, such representations give an acceptable level of compatibility, provided aggressive pointer arithmetic optimisations are avoided. The Morello processor incorporates a similar compression-based design in its architecture, though sizes of each field differ [12].

The added capability-aware instructions operate over capabilities. Conventional load and store operations are extended to first check that the tag, permissions, and bounds of the capability are all valid. Violations result in triggering a capability-related hardware exception. There are additional operations to access or change the tag, permissions, and bounds. To ensure spatial memory safety, these operations can, at most, make the conditions for execution more restrictive; they cannot grant that which was not previously available. For instance, one cannot lower the lower bound of a capability to access a region that was inaccessible before, or grant a store permission that was unset beforehand. Because of how tags work for capabilities stored in memory, one cannot grant capabilities larger bounds or more permissions by manipulating the memory—attempting this results in tag invalidation.

Library support for CHERI has grown over the past few years. In particular, a software stack for CHERI-C that utilises a custom Clang compiler now exists [41]. Users can compile their program either in 'purecap' mode, where all pointers in programs are replaced with capabilities, or in 'hybrid' mode, where both pointers and capabilities co-exist within the program. Because operations that change the fields of a capability does not generally exist in standard C, Clang incorporates additional CHERI libraries of operations that users may use to access or mutate capabilities.

## **3 CHERI-C Memory Model**

Incorporating hardware-enabled spatial safety requires significant changes to the C memory model. Pointer designs must be extended to incorporate bounds, metadata, and the out-of-band tag bit. The memory, i.e. heap, must also be able to distinguish the main memory and the tagged memory. Operations with respect to the heap must also be defined such that tag preservation and invalidation are incorporated appropriately.

In this section, we provide a generalised theory for the CHERI-C memory model. We identify the type and value system used by the memory model. We then define the heap and the core memory operations. Finally, we state some essential properties of the heap and the operations that (1) characterises the semantics and (2) states what types of verification or analyses could be supported. We make the assumption that we work on a 'purecap' environment, where all pointers have been replaced with capabilities.

## **3.1 Design**

The CHERI-C memory model is inspired by that of CompCert [26]. The beauty of CompCert is that it is a verified C compiler. The internal components, which include the block-offset based memory model, are formalised in a theorem prover, with many of its essential properties verified. Using CompCert's memory model as a basis, we design the CHERI-C memory model by providing extensions to ensure the modelling of correct semantics and the capture of safety violations:


## **3.2 Type and Value System**

Figure 2 shows the formalisation of CHERI-C types and values. Types τ are analogous to chunks in CompCert terms. Types comprise primitive types (e.g. U8<sup>τ</sup> ,

> τ - U8<sup>τ</sup> | S8<sup>τ</sup> | ... | U64<sup>τ</sup> | S64<sup>τ</sup> | Cap<sup>τ</sup> *MCap* - B × <sup>Z</sup> <sup>×</sup> md *Cap* - MCap <sup>×</sup> <sup>B</sup> <sup>V</sup><sup>C</sup> - U8<sup>V</sup> :: 8 bits | ... | S64<sup>V</sup> :: 64 sbits | Cap<sup>V</sup> :: Cap <sup>|</sup> CapF<sup>V</sup> :: Cap <sup>×</sup> <sup>N</sup> | Undef <sup>V</sup><sup>M</sup> - Byte :: 8 bits <sup>|</sup> MCapF :: MCap <sup>×</sup> <sup>N</sup>

> > Fig. 2: CHERI-C Types and Values

S64<sup>τ</sup> , etc.) and a capability type Cap<sup>τ</sup> . We define a function |·| : <sup>τ</sup> <sup>→</sup> <sup>N</sup> that returns, in terms of bytes, the size of the type. For Cap<sup>τ</sup> , the value is not fixed but requires that it must be divisible by 16. This requirement allows capabilities with 128- and 256-bit representations to have a valid size.

MCap represents a *memory capability* value and is represented as a tuple (b, i, m), which comprises the block identifier <sup>b</sup> ∈ B, offset <sup>i</sup> <sup>∈</sup> <sup>Z</sup>, and metadata m ∈ md, where md represents the bounds and permissions. Here, B must be a countable set. Offsets are represented as integers, as CHERI allows out-ofbounds addresses, where the address may be lower than the lower bound. Because capabilities stored in memory have their tag bit stored elsewhere, we make the distinction between memory capabilities and *tagged capabilities*, *Cap*, which is a capability ((b, i, m), t) that contains the tag bit <sup>t</sup> <sup>∈</sup> <sup>B</sup>.

Unlike those of CompCert, CHERI-C values V<sup>C</sup> are given type distinctions to ensure: (1) types can be inferred directly, and (2) they contain the correct values at all times. From a practical standpoint, this ensures that the proof of correctness of memory operations can be simplified, and bounded arithmetic operations can be implemented correctly. Capability values Cap<sup>V</sup> and capability fragment values CapF<sup>V</sup> also exist as values. Provided some capability value C ∈ Cap<sup>V</sup> , capability fragment values C<sup>n</sup> ∈ CapF<sup>V</sup> correspond to the n-th byte of the capability C. For both cases, instead of fixing their representation concretely, we represent them abstractly using a tuple. This representation ensures that conversion to a compressed representation could be achieved when needed while avoiding the need to fix to one particular bit representation. Furthermore, this approach provides a reasonable way to correctly define memcpy, where capability tags must be preserved if possible. While capability fragments are extended structures of capabilities, operations that can be performed on capability fragments are limited. Finally, we have *Undef* , which represents invalid values. These values may appear when, for example, the user calls malloc and immediately tries to load the undefined contents. The idea behind incorporating capability fragments values is heavily inspired by the work from [25].

Because values are given a type distinction, identifying the types of values is straightforward. For capability fragments, we have two choices: they may either be a U8<sup>τ</sup> or S8<sup>τ</sup> type. Capability fragments are essentially bytes, so operations over capability fragments can be treated as if they were a U8<sup>τ</sup> or S8<sup>τ</sup> type. Since *Undef* does not correspond to a valid value, it is not assigned a type.

> CapErr - TagViolation | PermitLoadViolation | ... LogicErr - UseAfterFree | MissingResource | ... Err - CapErr | LogicErr <sup>R</sup> <sup>ρ</sup> - Succ ρ | Fail Err

> > Fig. 3: CHERI-C Errors

Memory operations, such as load and store, are defined so that, upon failure, the operation returns the type of error that lead to the failure. In general, partial functions, or function using the option type, can model function failure but cannot state what caused the failure. As such, the operations use the return type R ρ, where ρ is a generic return type. For CHERI-C, we make the distinction between errors caused by capabilities, denoted by CapErr, and errors caused by the language, denoted by LogicErr. Figure 3 depicts the formalised Errors system used by the memory model.

#### **3.3 Memory**

We now formalise the memory. We use CompCert's approach of using a union type V<sup>M</sup> that can represent either a byte or a byte fragment of a memory capability. Then it is possible to create a memory mapping <sup>N</sup> <sup>V</sup>M. <sup>1</sup> We also create a separate mapping of type N B for tagged memory. When the user attempts to store a capability, it will be converted into a memory capability and then stored in the memory mapping. Separately, the tag bit will be stored in the tagged memory. When the tag bit is stored, adjustments are made to ensure tags are only stored in capability-size-aligned offsets.

To ensure we can catch temporal safety violations, we need to be able to make distinctions between blocks that are freed and blocks that are valid. One way to encode this is as follows: a block b may point to either a freed location (i.e. <sup>b</sup> → <sup>∅</sup>), or point to the pair of maps we defined earlier. The idea is that if a block identifier points to a freed block, attempts to load such a block will trigger a 'Use After Free' violation and would otherwise point to a valid mapping pair. Ultimately, the heap has the following form:

$$\mathcal{H}: \mathcal{B} \rightharpoonup ((\mathbb{N} \rightharpoonup \mathcal{V}\_{\mathcal{M}}) \times (\mathbb{N} \rightharpoonup \mathbb{B}))\_{\mathcal{B}}.$$

#### **3.4 Operations**

We define the core memory operations, or *actions*, of the memory model. We use the same result type R given in Fig. 3 instead of using a partial function to give the type of error, should the operation fail.

The memory actions A<sup>C</sup> = {alloc, free, load, store} are given below with their respective signatures:

**–** alloc : H → <sup>N</sup> → R (H × Cap) **–** free : H → Cap → R (H × Cap) **–** load : H → Cap → τ → R (VC) **–** store : H → Cap → V<sup>C</sup> → R (H)

<sup>1</sup>The notation denotes a partial map. Offsets in heaps are N, whereas offsets stored in capabilities are Z. Operations check whether the offsets are in bounds, which requires offsets to be non-negative. This means valid offset values can be converted from Z to N without issues.

The function alloc μ n = Succ (μ- , c) takes a heap μ and size n input and produces a fresh capability c and the updated heap μ as output. The bounds of c are determined by n. In the case of compressed capabilities, a sufficiently large n *may* result in the upper bound being larger than what was requested. The capability c is also given the appropriate permissions and a valid tag bit. Like that of CompCert, alloc is designed to never fail, provided that the countable set B has infinite elements.

The function free μ c = Succ (μ- , c- ) takes a heap μ and capability c = ((b, i, m), t) as input. Upon success, the operation will return the updated heap, where we now have <sup>b</sup> → <sup>∅</sup>. The capability <sup>c</sup> is also updated such that the tag bit of c is invalidated. This conforms to the CHERI-C design stated in [41]. We note that c should also be a valid capability, that is—at the very least—the tag bit should be set, and the offset should be within the capability bounds. The function free may fail if the block is invalid or already freed, even if the capability itself was valid. In such case, free returns a logical error.

The function load μct = Succ v takes a heap μ, capability c and type t as input, where t is the type the user wants to load. Upon success, the operation will return the value v from the memory, where v has the corresponding type t. <sup>2</sup> Before load attempts to access the block provided by c, it first checks that c has sufficient permissions to load. We use the CHERI-MIPS SAIL implementation of the CL[C] instruction [40] for the capability checks, implementing the extra checks provided that t = Cap<sup>τ</sup> . Once the capability checks are done, the operation attempts to access the blocks and the mappings, failing and returning the appropriate logical error if they do not exist.

When accessing both the main memory and tagged memory, there are a number of cases to consider. When loading primitive values, it is important that the region about to be loaded is all of *Byte* and not of *MCapF* type. Thus, before loading the values, we check whether the contiguous region in memory are all of *Byte* type. If this is not the case, load will return *Undef* . For capability fragments, the cell in memory has to be an *MCapF*. Finally for capabilities, not only do the contiguous cells have to be of *MCapF* type, but (1) they must have the same memory capability value, and (2) the fragment values must all be a sequence forming {0, 1, ..., |Cap<sup>τ</sup> | − 1}. The idea is that even if the contiguous cells have the same memory capability values, they do not form a valid capability if the fragments are not stored in order. After all the checks, the tagged memory will be accessed, where the tag value is retrieved.<sup>3</sup> The loaded memory capability and tag bit are then combined to form a tagged capability, which load returns.

The function store μcv = Succ μ takes a heap μ, capability c, and value v. Upon success, the operations will return the updated heap μ- . Like load, store performs the necessary capability checks based on CHERI-MIPS' CS[C] instruction and attempts to access the blocks and mappings afterwards, returning the appropriate exception upon failure. For storing primitive values and capability

<sup>2</sup>For capability fragments, the corresponding type may be either <sup>U</sup>8<sup>τ</sup> or <sup>S</sup>8<sup>τ</sup> . <sup>3</sup>The tagged memory does not need to be accessed if <sup>c</sup> does not have a capability load permission. In such case, the loaded capability will have an invalidated tag.

fragment values, the main memory mapping will simply be updated to contain the values, and the associated tagged memories will be invalidated. For primitive values that are not bytes, the values will be converted into a sequence of bytes, where each byte in the list will be stored contiguously in memory. For a capability fragment value, it will be stored in the cell as an *MCapF* type, where the tag value of the fragment will be stripped when storing in memory. Finally, for capability values, the value will be split into a list comprising |Cap<sup>τ</sup> | − 1 memory capability fragments, with the fragment value forming a sequence {0, 1, ..|Cap<sup>τ</sup> | − 1}, and a tag bit. The main memory will store the list of memory fragments contiguously, and the tagged memory will store the tag value in the corresponding capability-aligned tagged memory.

## **3.5 Properties**

In the previous section, we have articulated a formal CHERI-C memory model, explaining how the heap is structured and how the operations are defined. It is essential that the formalisation we provided is correct and is also suitable for verification or other types of analyses. In this section, we first discuss the properties of the memory. We then discuss the properties of the operations themselves, primarily concerned with correctness.

When we observe the memory, it is important that we always work with a valid one, i.e. the memory is *well-formed*. In our formalisation, we require that all tags in the tagged memory are stored in a capability-aligned location. The well-formedness relation W<sup>C</sup> <sup>f</sup> is defined as follows:

$$\mathcal{W}\_f^{\mathcal{C}}(\mu) \equiv \forall b \in dom(\mu). \; b \mapsto (c, t) \longrightarrow \forall x \in dom(t). \; x \bmod |Cap\_\tau| = 0$$

The well-formedness property must hold when the heap is initialised and when memory operations mutate the heap. That is, provided μ<sup>0</sup> is the initialised heap where all mappings are empty, α ∈ A<sup>C</sup> is a memory action, v are the arguments of the memory operation α and μ is one of the return values denoting the updated heap, we have the following properties:

$$\mathcal{W}\_f^{\mathbb{C}}(\mu\_0)$$

$$\mathcal{W}\_f^{\mathbb{C}}(\mu) \Longrightarrow \alpha \text{ } \mu \text{ } v = \mathcal{S} \\ \text{acc } \mu' \Longrightarrow \mathcal{W}\_f^{\mathbb{C}}(\mu')$$

The two properties above ensure that the heap is well-formed throughout the execution of the CHERI-C program.

For the correctness of the operations, we primarily consider soundness and completeness:


The first and second points are simple soundness and completeness properties. The third point is important in that the input may be problematic in many ways. For example, the NULL capability has an invalid tag bit, invalid bounds, and no permissions. The function load will fail if provided with the NULL capability, as it violates many of the checks. Because the SAIL specification states that tags are always checked first, the error must be a TagViolation type.

Next, we need to ensure successive operations yield the desired result. The primary properties to consider are the *good variable* laws [26]; examples of properties encoding this law include *load after allocation*, *load after free*, and *load after store*. It is worth mentioning there are some caveats. For example, the *load after store* case no longer guarantees that you will retrieve the same value you stored, unlike CompCert's load after store property in [26], since the value that was stored and to be loaded again could have been either a capability or capability fragment. In such cases, the tag bit may become invalidated due to insufficient permissions on the capability, or because storing capability fragments resulted in the tagged memory being cleared. The solution is to divide the general property into a primitive value case and a capability-related value case. Ultimately, the idea is to prove that the loaded value is *correct* rather than exact, i.e. capability-related values when loaded with have the correct tag value.

Finally, we have properties suitable for verification. We note that the memory H can be instantiated as a separation algebra by providing the partial commutative monoid (PCM) (H, , μ0), where is the disjoint union of two heaps and μ<sup>0</sup> is the empty initialised heap. For tools that rely on using partial memories, it is also imperative to show that the well-formedness property is compatible with memory composition:

$$\mathcal{W}\_f^{\mathcal{C}}(\mu\_1 \uplus \mu\_2) \Longrightarrow \mathcal{W}\_f^{\mathcal{C}}(\mu\_1) \land \mathcal{W}\_f^{\mathcal{C}}(\mu\_2)$$

We also note that the current heap design keeps track of *negative* resources [28], which may potentially be useful for incorrectness logic based verification [33].

## **4 Application**

The overall memory model provided in Sect. 3 has been designed to be applicable for verification tools. In this section, we explain how we use the theory provided above to create a verified, executable instance of the memory model. We then explain how this executable model can be used to instantiate a tool called Gillian [20]. Using the instantiated tool, we demonstrate the concrete execution of CHERI-C programs with the desired behaviour.

#### **4.1 Isabelle/HOL**

Isabelle/HOL is an interactive theorem prover based on classical Higher Order Logic (HOL) [32]. We use Isabelle/HOL to formalise the entirety of the CHERI-C memory model discussed in Sect. 3. Types, values, heap structure, etc. were implemented, memory operations were defined, and properties relating to the heap and the operations were proven. Memory capabilities, tagged capabilities, and capability fragments were represented using records, a form of tuple with named fields. For code generation, we instantiated the block type B to be <sup>Z</sup>. For showing that <sup>H</sup> is an instance of a separation algebra, we use the cancellative sep algebra class [23] and prove that the heap model is an instance. This proof ultimately shows that H forms a PCM. Proving that wellformedness is compatible with memory composition is stated slightly differently. The cancellative sep algebra class takes in a total operator ·<sup>t</sup> instead of a partial one and requires a 'separation disjunction' binary operator #, which states disjointedness. Ultimately, the compatibility property can be given as:

$$\mu\_1 \nparallel \mu\_2 \Longrightarrow \mathcal{W}\_f^{\mathcal{C}}(\mu\_1 \cdot\_t \mu\_2) \Longrightarrow \mathcal{W}\_f^{\mathcal{C}}(\mu\_1) \land \mathcal{W}\_f^{\mathcal{C}}(\mu\_2)$$

For partial mappings of the form AB, we use Isabelle/HOL's finite mapping type ('a,'b)mapping [22]. To ensure we obtain an OCaml executable instance of the memory model, we use the Containers framework [27], which generates a Red-Black Tree mapping provided the abstract mapping in Isabelle/HOL. All definitions in Isabelle were either defined to be code-generatable to begin with (i.e. definitions should not comprise quantifiers or non-constructive constants like the Hilbert choice operation SOME), or code equations were provided and proven to ensure a sound code generation [21]. For bounded machine words, which is required for formalising the primitive values, we use Isabelle/HOL's word type 'a word, where 'a states the length of the word [14]. Types like 'a word, nat, int and string were also transformed to use OCaml's Zarith and native string library for efficiency [21].

#### **4.2 Gillian**

Gillian is a high-level analysis framework, theoretically capable of analysing a wide range of languages. The framework allows concrete and symbolic execution, verification based on Separation Logic, and bi-abduction [28]. The crux of the framework lies in its parametricity, where the tool can be instantiated by simply providing a compiler front end and OCaml-based memory models of the language. So far, CompCert C and JavaScript have both been instantiated for Gillian, giving birth to Gillian-C and Gillian-JS.

The underlying theoretical foundation of Gillian has its essential correctness properties like soundness and completeness already proven [20, 29]. Thus, users who instantiate the tool only need to prove the correctness of the implementation of their compiler and memory models to ensure the correctness of the entire tool. From the perspective of someone trying to instantiate Gillian with their compiler and memory models, it is essential to understand the underlying intermediate language GIL and the overall memory model interface used by Gillian.

**GIL** GIL is the GOTO-based Intermediate Language used by Gillian which is used for all types of analyses the tool supports. For concrete execution, GIL supports basic GOTO constructs and assertions. For symbolic execution, the GIL grammar is extended to support path cutting, i.e. assumptions, and generation of symbolic variables. For separation logic based verification, the GIL grammar is further extended to support core predicates and user-defined predicates [28] that can be utilised to form separation logic based assertions. Furthermore, function specifications in the Hoare-triple form {P}f(¯x){Q} can be provided, where P and Q are separation logic based assertions.

Note that Gillian uses a value set V which differs from that used in the CHERI-C memory model. As we are only interested in the values used in the CHERI-C memory model, it is possible to implement a thin conversion layer between the two value systems. We note that a list of GIL values also constitutes a GIL value, so arguments for functions can be expressed as a single GIL value. This is important when understanding the memory model layout of Gillian.

**Memory Model** Memory Models in Gillian have a specific definition and have properties that state what kind of analysis is supported. Proving that the provided memory models satisfy certain properties is essential in understanding what the instantiated tool supports.

Gillian differentiates between concrete and symbolic memory models, which are used for concrete and symbolic execution, respectively. As we are concerned with concrete execution, we will consider only concrete memory models here.

At the highest level, there are two kinds of memory model properties: *executional* and *compositional*. The *executional* memory model states properties a memory model must have for whole-program execution, and the *compositional* memory model states properties a memory model must have for separation logic based symbolic verification. Each paper in the Gillian literature states slightly different definitions for the memory models [20, 28, 29, 37]—in Definitions 1 and 2 below, we present unified, consistent definitions for each of the memory model properties. We ignore contexts, as there exists only one context in concrete memories, which is the GIL boolean value true.

**Definition 1.** *(Execution Memory Model). Given the set of GIL values* V *and an action set* <sup>A</sup>*, an execution memory model* <sup>M</sup>(V, A) - (|M|, W<sup>f</sup> , ea) *comprises:*


**Definition 2.** *(Compositional Memory Model). Given the set of GIL values* V *and core predicate set* Γ*, a compositional memory model,* M(V,A<sup>Γ</sup> ) - (|M|, W<sup>f</sup> , ea<sup>Γ</sup> ) *comprises:*


W<sup>f</sup> (μ<sup>1</sup> · μ2) =⇒ W<sup>f</sup> (μ1) ∧ W<sup>f</sup> (μ2)

*3. the predicate action execution function* ea<sup>Γ</sup> : A<sup>Γ</sup> → |M| → V R (|M|×V )

First, we note that for concrete execution, Gillian also uses the return type R in the action execution function ea. <sup>4</sup> For <sup>W</sup><sup>f</sup> defined in Definition 1, the main properties that must be satisfied are Properties 3.1, 3.2, and 3.6 in [29].

The PCM requirement is required to show that the heap forms a separation algebra [16]. W<sup>f</sup> is extended to state that memory composition must also be wellformed. Finally, the predicate action execution function ea<sup>Γ</sup> provides a way to frame on and off parts of the memory, though they are not required for concrete execution as they are not part of the GIL concrete execution grammar.

Using the CHERI-C memory model we defined earlier, we can show that our model conforms to both Definitions 1 and 2. Let A<sup>C</sup> be the set of memory actions, <sup>H</sup> be the memory, ea<sup>C</sup> be the action execution function of the CHERI-C memory model, and W<sup>C</sup> <sup>f</sup> be the well-formedness relation. Then we observe that (H, W<sup>C</sup> <sup>f</sup> , eaC) forms an execution memory model. We note that Properties 3.1 and 3.2 in [29] are satisfied, and Property 3.6 is trivial in that operations that return errors do not return an updated heap. We also note that the memory model also conforms to a compositional memory model, as we have the PCM (H, , μ0) along with the well-formedness property being compositioncompatible. The predicate action execution function is not required to be given, as the concrete execution of Gillian does not utilise this feature.

#### **4.3 Compiler**

We implemented a CHERI-C to GIL compiler by utilising ESBMC's GOTO language. The idea is that ESBMC uses its own intermediate representation for bounded model checking, which is the GOTO language. CHERI-enabled ESBMC uses Clang as a front end to generate the GOTO language. In our case we can build a GOTO to GIL compiler instead of building a CHERI-C compiler from scratch. The GOTO language is very similar to GIL in that they are both goto-based languages and uses single static assignment. For most parts, the compilation process is straightforward. As ESBMC's GOTO language is typed while the CHERI-C memory model is untyped—untyped in the sense that the memory model does not support user-defined types like structs—we make sure that capability arithmetic and casts are applied correctly by inferring the sizes of the user-defined types.

## **5 Experimental Results**

In Sect. 4, we have provided a way to instantiate the Gillian tool, where we obtain a concrete CHERI-C model using Isabelle/HOL and a CHERI-C to GIL

<sup>4</sup>In the Gillian literature, it is stated that <sup>R</sup> can return both a return value and an error. The OCaml implementation of Gillian slightly differs from this and is more similar to R used for the CHERI-C memory model.

compiler that utilises ESBMC's GOTO language. Our framework can demonstrate that higher-level memory actions—such as memcpy(), which preserves tags when applicable—can be implemented. Furthermore, we can run concrete instances of programs that use memcpy() to show they emit the expected behaviour. This also means the tool can catch the TagViolation exception that is triggered in Listing 1.1. Our tool also allows capability-related functions defined in cheriintrin.h and cheri.h, to be usable, i.e. it is possible to call operations such as cheri tag get() and cheri tag clear().


Table 1: Violation detection


Table 2: GCC runtime performance

Table 1 shows a list of safety violations that Gillian-C, our tool, the ARM Morello hardware, and CHERI-ESBMC—labelled as GC, GCC, AM, and BMC, respectively—all catch. We observe that Morello fails to catch temporal safety violations such as dangling pointers and double frees. For the invalid free case, where we attempt to free a pointer not produced by malloc, we discovered a bug in the Gillian-C tool that fails to catch this violation.<sup>5</sup> Gillian-C does not return any errors for the program in Listing 1.1, which is to be expected, as this is not problematic for conventional C. Finally, we observe that CHERI-ESBMC fails to catch the last two violations that relating to tag invalidation.

Table 2 shows the runtime performance of running the CHERI-C library test suites, based on the Clang CHERI-C test suite [1]. Tests were conducted on a machine running Fedora 34 on an 11th Gen Intel Core i7-1185G7 CPU with 31.1 GB RAM, with trace logging enabled. We note that when the test cases were executed on Morello without any modifications to the code, all of the tests terminated instantaneously without any issues. In the libc malloc.c test case, we reduced the scope of the test<sup>6</sup> to ensure the tool terminates within a reasonable time, though the performance can be drastically improved by turning logging off, e.g. the libc malloc.c case would only take 0.686 seconds. For the remaining tests, we made modifications to the code to ensure the compiler can correctly produce the GIL code, and we made sure to preserve all the edge cases covered by the original tests. For example, in libc memcpy.c we made sure to test all cases where both src and dst capabilities were aligned and misaligned in the beginning and the end, which affected tag preservation. We observed that no assertions were violated, and we also observed that the same

<sup>5</sup>The bug has since been fixed after a discussion with the developers [7].

<sup>6</sup>In particular, we reduced max from the libc malloc.c case in [1] from 20 to 9.

code when run in Morello also resulted in no assertion violations, demonstrating a faithful implementation of CHERI-C semantics.

## **6 Related Work**

The CompCert C memory model [26], CH2O memory model [24], and Tuch's C memory model [38] are C memory models formalised in a theorem prover, each focusing on different aspects of verification. Our model mostly draws inspiration from these models, extending such work to support CHERI-C programs.

VCC, which internally uses the typed C memory model [19], and CHERI-ESBMC [15] are designed with automated verification of C programs via symbolic execution in mind—in particular, CHERI-ESBMC supports hybrid settings and compressed capabilities in addition to purecap settings and uncompressed capabilities. Both tools rely on a memory model that is not formally verified, so the tools have components that must be trusted.

## **7 Conclusion and Future Work**

We have provided a formal CHERI-C memory model and demonstrated its utility for verification. We formalised the entire theory in Isabelle/HOL and generated an executable instance of the memory model, which was then used to instantiate a CHERI-C tool. The result lead to a concrete execution tool that is robust in terms of the properties that are guaranteed both by the tool and by the memory model. We demonstrated its practicality by running CHERI-C based test suites, capturing memory safety violations, and comparing the results with actual CHERI hardware—namely the physical Morello processor.

Currently there are a number of limitations provided by the memory model. Capability arithmetic is limited only to addition and subtraction, but the heap can be extended to incorporate mappings from blocks to physical addresses and vice versa. This provides a way to extend capability arithmetic. While the theory incorporates abstract capabilities, compression is still under work. We believe, however, that the abstract design itself does not need to change. It may be possible to utilise the compression/decompression work to convert between the two forms [2] when needed whilst retaining our design for the operations.

This theory serves as a starting point for much potential future work. A compositional symbolic memory model can be built from this design to enable symbolic execution and verification in Gillian. As we have already proven the core properties, proving the remaining properties for the extended model will allow automated separation logic based verification of CHERI-C programs.

**Acknowledgements** We are very grateful to the Gillian team, in particular, Sacha-Elie Ayoun, for providing assistance with instantiating the Gillian tool. ´ We also thank Fedor Shmarov and Franz Brauße for providing assistance with building and modifying the ESBMC tool. This work was funded by the UKRI programme on Digital Security by Design (Ref. EP/V000225/1, SCorCH [10]).

**Data-Availability Statement** The Isabelle/HOL formalisation of the CHERI-C memory model described in Sect. 4.1 is available in the Isabelle Archive of Formal Proofs [34]. The artefact of the evaluation provided in Sect. 5, which includes Gillian-CHERI-C itself, CHERI-ESBMC, and other tools, is archived in the Zenodo open-access repository [35].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/ by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Automated Verification for Real-Time Systems via Implicit Clocks and an Extended Antimirov Algorithm

Yahui Song() and Wei-Ngan Chin

School of Computing, National University of Singapore, Singapore, Singapore {yahuis,chinwn}@comp.nus.edu.sg

Abstract. The correctness of real-time systems depends both on the correct functionalities and the realtime constraints. To go beyond the existing Timed Automata based techniques, we propose a novel solution that integrates a modular Hoare-style forward verifier with a term rewriting system (TRS) on Timed Effects (TimEffs). The main purposes are to: increase the expressiveness, dynamically manipulate clocks, and efficiently solve clock constraints. We formally define a core language C t , generalizing the real-time systems, modeled using mutable variables and timed behavioral patterns, such as delay, timeout, interrupt, deadline. Secondly, to capture real-time specifications, we introduce TimEffs, a new effects logic, that extends regular expressions with dependent values and arithmetic constraints. Thirdly, the forward verifier reasons temporal behaviors – expressed in TimEffs – of target C <sup>t</sup> programs. Lastly, we present a purely algebraic TRS, i.e., an extended Antimirov algorithm, to efficiently check language inclusions between TimEffs. To demonstrate the feasibility of our proposal, we prototype the verification system; prove its soundness; report on case studies and experimental results.

## 1 Introduction

During the last three decades, a popular approach for specifying real-time systems has been based on Timed Automata (TAs) [1]. TAs are powerful in designing real-time models via explicit clocks, where real-time constraints are captured by explicitly setting/resetting clock variables. A number of automatic verification tools for TAs have proven to be successful [2,3,4,5]. Industrial case studies show that requirements for real-time systems are often structured into phases, which are then composed sequentially, in parallel, alternatively [6,7]. TAs lack highlevel compositional patterns for hierarchical design; moreover, users often need to manipulate clock variables with carefully calculated clock constraints manually. The process is tedious and error-prone.

There have been some translation-based approaches on building verification support for compositional timed-process representations. For example, Timed Communicating Sequential Process (TCSP), Timed Communicating Object-Z (TCOZ) and Statechart based hierarchical Timed Automata are well suited for presenting compositional models of complex real-time systems. Prior works [8,9] systematically translate TCSP/TCOZ/Statechart models to flat TAs so that the

https://doi.org/10.1007/978-3-031-30823-9 29 © The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 569–587, 2023. model checker Uppaal [3] can be applied. However, possible insufficiencies are: the expressiveness power is limited by the finite-state automata; and there is always a gap between the verified logic and the actual code implementation.

In this work, we investigate an alternative approach for verifying real-time systems. We propose a novel temporal specification language, Timed Effects (Tim-Effs), which enables a compositional verification via a Hoare-style forward verifier and a term rewriting system (TRS). More specifically, we specify system behaviors in the form of TimEffs, which integrates the Kleene Algebra with dependent values and arithmetic constraints, to provide real-time abstractions into traditional linear temporal logics. For example, one safety property, "The event Done will be triggered no later than one time unit"<sup>1</sup> , is expressed in TimEffs as: Φ , 0≤t<1 ∧ ( ? · Done)#t. Here ∧ connects the arithmetic formula and the timed trace; the operator # binds time variables to traces (here t is a time bound of ( ? · Done)); is a wildcard matching to any event; Kleene star ? denotes a trace repetition. The above formula Φ corresponds to '♦[0,1)Done' in metric temporal logic (MTL), reads "within one time unit, Done finally happens". Furthermore, the time bounds can be dependent on the program inputs, as shown in Fig. 1.

```
1 void addOneSugar ()
2 /* req: true ∧
                 ?
3 ens: t>1 ∧  # t */
4 { timeout (() , 1) ; }
5
6 void addNSugar ( int n )
7 /* req: true ∧
                 ?
8 ens: t≥n ∧ EndSugar # t */
9 { if ( n == 0) {
10 event [" EndSugar " ];}
11 else {
12 addOneSugar () ;
13 addNSugar (n -1) ;}}
```
Fig. 1. Value-dependent specification.

Function addNSugar takes a parameter n, representing the portion of the sugar to add. When n=0, it raises an event EndSugar to mark the end of the process. Otherwise, it adds one portion of the sugar by calling addOneSugar(), then recursively calls addNSugar with parameter n-1. The use of timeout(e, d) is standard [11], which executes a block of code e after the specified time d. Therefore, the time spent on adding one portion of the sugar is more than one time unit. Note that #t refers to an empty trace which takes time t. Both preconditions require no arithmetic constraints and no temporal constraints upon the history

traces. The postcondition of addNSugar(n) indicates that the method generates a finite trace where EndSugar takes a no less than n time-units delay to finish.

Although these examples are simple, they show the benefits of deploying value-dependent time bounds, which is beyond the capability of TAs. Essentially, TimEffs define symbolic TAs, which stands for a set (possibly infinite) of concrete transition systems. Moreover, we deploy a Hoare-style forward verifier to soundly reason about the behaviors from the source level, with respect to the well-defined operational semantics. This approach provides a direct (opposite to the techniques which require manual and remote modeling processes), and modular verification – where modules can be replaced by their already verified properties – for real-time systems, which are not possible by any existing tech-

<sup>1</sup> In this paper, we pretend time is discrete and only integral values. However, it's just as easy to represent continuous time by letting time variables assume real values [10].

niques. Furthermore, we develop a novel TRS, which is inspired by Antimirov and Mosses' algorithm<sup>2</sup> [12] but solving the language inclusions between more expressive TimEffs. In short, the main contributions of this work are:

1. Language Abstraction: we formally define a core language C <sup>t</sup>, by defining its syntax and operational semantics, generalizing the real-time systems with mutable variables and timed behavioral patterns, e.g., delay, timeout, deadline. 2. Novel Specification: we propose TimEffs, by defining its syntax and seman-

tics, gaining the expressive power beyond traditional linear temporal logics.

3. Forward Verifier: we establish a sound effect system to reason about temporal behaviors of given programs. The verifier triggers the back-end solver TRS. 4. Efficient TRS: we present the rewriting rules to (dis)prove the inclusion relations between the actual behaviors and the given specifications, both in TimEffs. 5. Implementation and Evaluation: we prototype the automated verification system, prove its soundness, report on case studies and experimental results.

## 2 Overview

An overview of our automated verification system is given in Fig. 2. The system consists of a forward verifier and a TRS, i.e., the rounded boxes. The input of the forward verifier is a C <sup>t</sup> program annotated with temporal specifications written in TimEffs. The input of the TRS is a pair of effects LHS and RHS, referring to the inclusion LHS RHS<sup>3</sup> to be checked

Fig. 2. System Overview.

(LHS and RHS refer to left/right-hand-side effects respectively). The forward verifier calls TRS to solve proof obligations. Next, we use Fig. 3 to highlight our main methodologies, which simulates a coffee machine, that dynamically adds sugar based on the user's input number.

2.1 . TimEffs. We define Hoare-triple style specifications (enclosed in /\*...\*/) for each function, which leads to a compositional verification strategy, where static checking can be done locally. The precondition of makeCoffee specifies that the input value n is non-negative, and it requires that before entering into this function, this history trace must contain the event CupReady on the tail. The verification fails if the precondition is not satisfied at the caller sites. Line 17 sets a five time-units deadline (i.e., maximum 5 portion of sugar per coffee) while calling addNSugar (defined in Fig. 1); then emits event Coffee with a deadline, indicating the pouring coffer process takes no more than four time-units. The precondition of main requires no arithmetic constraints (expressed as true) and an empty history trace. The postcondition of main specifies that before the final

<sup>2</sup> Antimirov and Mosses' algorithm was designed for deciding the inequalities of regular expressions based on an axiomatic algorithm of the algebra of regular sets.

<sup>3</sup> The TimEffs inclusion relation is formally defined in Definition 3.

Done happens, there is no occurrence of Done (! indicates the absence of events); and the whole process takes no more than nine time-units to hit the final event.

```
14 void makeCoffee ( int n )
15 /* req: n≥0 ∧
                 ?
                  · CupReady
16 ens: n≤t≤5 ∧ t'≤4 ∧
        (EndSugar # t) · (Coffee # t') */
17 { deadline ( addNSugar ( n ) , 5) ;
18 deadline ( event [" Coffee "] ,4) ;}
19
20 int main ()
21 /* req: true ∧ 
22 ens: t≤9 ∧ ((!Done)?
                          # t) · Done */
23 { event [" CupReady "];
24 makeCoffee (3) ;
25 event [" Done "];}
```
TimEffs support more features such as disjunctions, guards, parallelism and assertions, etc (cf. Sec. 3.3), providing detailed information upon: branching properties: different arithmetic conditions on the inputs lead to different effects; and required history traces: by defining the prior effects in precondition. These capabilities are beyond traditional timed verification, and cannot be fully captured by any prior works [8,9,2,3,4,5]. Nevertheless, the in-

Fig. 3. To make coffee with three portions of sugar within nine time units.

crease in expressive power needs support from finer-grind reasoning and a more sophisticated back-end solver, discharged by our forward verifier and TRS.

```
1. void addOneSugar(){ // initialize the state using the function precondition.
 ΦC =ΦaddOneSugar(n)
       pre = {true ∧
                              ?
                                } [FV -Meth]
2. timeout ((), 1);}
 Φ
   0
   C ={t1>1 ∧
               ?
                · ( # t1)} [FV -Timeout]
3. Φ
    0
    C v Φ
         addOneSugar(n)
         pre · Φ
                       addOneSugar(n)
                       post ⇔ t1>1 ∧
                                             ?
                                               · (#t1) v t>1 ∧
                                                                ?
                                                                 · (#t)
4. void addNSugar (int n){ // initialize the state using the function precondition.
 ΦC =ΦaddNSugar(n)
       pre = {true ∧
                             ?
                             } [FV -Meth]
5. if (n == 0){
 {n=0 ∧
          ?
           } [FV -Cond]
6. event ["EndSugar"];}
 {n=0 ∧
          ?
           · EndSugar} [FV -Event]
7. else {
 {n6=0 ∧
          ?
           } [FV -Cond]
8. addOneSugar();
 {n6=0∧t2>1 ∧
               ?
                 · ( # t2)} [FV -Call]
9. addNSugar (n-1);}}
 n6=0∧t2>1 ∧
              ?
                · ( # t) v Φ
                           addNSugar(n-1)
                           pre // TRS: precondition checked.
               ?
                            addNSugar(n-1)
```
{n6=0∧t2>1 ∧ · ( # t2) · Φ post } [FV -Call] 10. Φ 0 <sup>C</sup> = (n=0 ∧ ? ·Sugar) ∨ (n6=0∧t2>1 ∧ ? ·(#t2)·Φ addNSugar(n-1) post ) [FV -Cond] 11. Φ 0 <sup>C</sup> v Φ addNSugar(n) pre · Φ addNSugar(n) post ⇔ //TRS: postcondition checked, cf. Table 1 (n=0 ∧ Sugar) ∨ (n6=0∧t2>1 ∧ ( # t2) · Φ addNSugar(n-1) post ) v Φ addNSugar(n) post

Fig. 4. The forward verification examples (t1 and t2 are fresh time variables).

2.2 . Forward Verification. Fig. 4 demonstrates the forward verification of functions addOneSugar and addNSugar, defined in Fig. 1. The effects states are captured in the form of {Φ<sup>C</sup> }. To facilitate the illustration, we label the steps

by (1) to (11), and mark the deployed forward rules (cf. Sec. 4.1) in [gray]. The initial states (1) and (4) are obtained from the preconditions, by the [FV -Meth] rule. States (5)(7)(10) are obtained by [FV -Cond], which enforces the conditional constraints into the effects states, and unions the effects accumulated from two branches. State (6) is obtained by [FV -Event], which concatenates an event to the current effects. The intermediate states (8) and (9) are obtained by [FV -Call]. Before each function call, [FV -Call] invokes the TRS to check whether the current effects states satisfy callees' preconditions. If it is not satisfied, the verification fails; otherwise, it concatenates the callee's postcondition to the current states (the precondition check for step (8) is omitted here).

State (2) is obtained by [FV -Timeout], which adds a lower time-bound to an empty trace. After these state transformations, steps (3) and (11) invoke the TRS to check the inclusions between the final effects and the declared postconditions. 2.3 . The TRS. Having TimEffs to be the specification language, and the forward verifier to reason about the actual behaviors, we are interested in the following verification problem: Given a program P, and a temporal specification Φ 0 , does the inclusions Φ<sup>P</sup> v Φ <sup>0</sup> holds? Typically, checking the inclusion/entailment between the concrete program effects Φ<sup>P</sup> and the expected property Φ<sup>0</sup> proves that: the program P will never lead to unsafe traces which violate Φ<sup>0</sup> .

Our TRS is an extension of Antimirov and Mosses's algorithm [12], which can be deployed to decide inclusions of two regular expressions (REs) through an iterated process of checking inclusions of their partial derivatives [13]. There are two basic rules: [Disprove] infers false from trivially inconsistent inclusions; and [Unfold] applies Definition 2 to generate new inclusions.

Definition 1 (Derivative). Given any formal language S over an alphabet Σ and any string u∈Σ<sup>∗</sup> , the derivative of S with respect to u is defined as: u -<sup>1</sup>S={w∈Σ<sup>∗</sup> | uw∈S}.

#### Definition 2 (REs Inclusion). For REs r and s, rs⇔∀(A∈Σ).A -1 (r )A -1 (s).

Definition 3 (TimEffs Inclusion). For TimEffs Φ<sup>1</sup> and Φ<sup>2</sup> , Φ<sup>1</sup> v Φ<sup>2</sup> ⇔ ∀A.∀t≥0 . (A#t) -<sup>1</sup>Φ<sup>1</sup> v (A#t) -<sup>1</sup>Φ<sup>2</sup> .

Similarly, we defined Definition 3 for unfolding the inclusions between Tim-Effs, where (A#t) -<sup>1</sup>Φ is the partial derivative of Φ w.r.t the event A with the time bound t. Termination of the rewriting is guaranteed because the set of derivatives to be considered is finite, and possible cycles are detected using memorization (cf. Table 5) [14]. Next, we use Table 1 to demonstrate how the TRS automatically proves the final effects of main satisfying its postcondition (shown at step (11) in Fig. 4). We mark the rewriting rules (cf. Sec. 5) in [gray].

In Table 1, step ○1 renames the time variables to avoid the name clashes between the antecedent and the consequent. Step ○2 splits the proof tree into two branches, according to the different arithmetic constraints, by rule [LHS-OR]. In the first branch, step ○3 eliminates the event ES from the head of both sides, by rule [UNFOLD]. Step ○4 proves the inclusion, because evidently the consequent tR≥0 ∧ #tR contains when tR=0. In the second branch, step ○5 eliminates a Table 1. An inclusion proving example. (I) is the right hand side sub-tree of the the main rewriting proof tree. (ES stands for the event EndSugar)

```
○4 [PROVE] n=0 ∧   tR≥0 ∧  # tR ○3 [UNFOLD] n=0 ∧ ✟ES✟  tR≥0 ∧ ✟ES#tR ✟ (I) ○2 [LHS-OR] (n=0∧ES) ∨ (n=0∧t2>1∧tL≥(n-1)∧ #t2 · ES#tL)  tR≥n ∧ ES#tR ○1 [RENAME]
   (n=0 ∧ ES) ∨ (n=0∧t2>1 ∧ ( # t2) · ΦaddNSugar (n-1)
                                                               post )  ΦaddNSugar (n)
                                                                                       post
(I)
```

```
t2>1∧tL≥(n-1) ∧ tL=(tR-t2) ⇒ tR≥n ○7 [PROVE] n=0∧t2>1∧tL≥(n-1) ∧   tR≥n ∧  ○6 [UNFOLD] πu:tL=(tR-t2) n=0∧t2>1∧tL≥(n-1) ∧ ✘✘✘✘ ES#tL  tR≥n ∧ ✭✭✭✭✭✭✭✭ ES#(tR-t2) ○5 [UNFOLD] n=0∧t2>1∧tL≥(n-1) ∧ ✘✘✘ #t2· ES#tL  tR≥n ∧ ES✘#tR✘✘
```
time duration #t2 from both sides. Therefore the rule [UNFOLD] subtracts a time duration from the consequent, i.e., (tR-t2). Similarly, step ○6 eliminates ES#tL from the both sides, adding tL=(tR-t2) to the unification constraints. Step ○7 proves t2>1∧tL≥(n-1)∧tL=(tR-t2)⇒tR≥<sup>n</sup> <sup>4</sup>; therefore, the proof succeed.

```
2.4 . Verifying the Fischer's Mutual Exclusion Protocol. Fig. 5 presents
```
the classical Fischer's mutually exclusion protocol, in C <sup>t</sup>. Global variables x and cs indicate 'which process attempted to access the critical section most recently' and 'the number of processes accessing the critical section' respectively. The main procedure is a parallel composition of three processes, where d and e are two constants. Each process attempts to enter the critical section when x is -1, i.e. no other process is currently attempting. Once the process is active (i.e., reaches line 6), it sets x to its identity number i within d

Fig. 5. Fischer's mutually exclusion algorithm.

time units, captured by deadline(...,d). Then it idles for e time units, captured by delay(e) and then checks whether x still equals to i. If so, it safely enters the critical section. Otherwise, it restarts from the beginning. Quantitative timing constraint d<e plays an important role in this algorithm to guarantee mutual exclusion. One way to prove mutual exclusion is to show that cs≤1 is always true. Or, using event temporal logic, we can show that the occurrence of Critical always indicates the next event is Exit. We show in Sec. 6 that our prototype system can verify such algorithms symbolically.

<sup>4</sup> The proof obligations for arithmetic constraints are discharged by the Z3 solver [15].

## 3 Language and Specifications

#### 3.1 The Target Language

We define the core language C t in Fig. 6, which is built based on C syntax and provides support for timed behavioral patterns.


Fig. 6. A core first-order imperative language with timed constructs via implicit clocks.

Here, c and b stand for integer and Boolean constants, mn and x are metavariables, drawn from var (the countably infinite set of arbitrary distinct identifiers). A program P comprises a list of global variable initializations α <sup>∗</sup> and a list of method declarations meth<sup>∗</sup> . Here, we use the ∗ superscript to denote a finite list of items, for example, x ∗ refers to a list of variables, x<sup>1</sup> , ..., xn. Each method meth has a name mn, an expression-oriented body e, also is associated with a precondition Φpre and a postcondition Φpost (specification syntax is given in Fig. 7). C <sup>t</sup> allows each iterative loop to be optimized to an equivalent tail-recursive method, where mutation on parameters is made visible to the caller.

Expressions comprise: values v; guarded processes [v]e, where if v is true, it behaves as e, else it idles until v becomes true; method calls mn(v ∗ ); sequential composition e<sup>1</sup> ; e<sup>2</sup> ; parallel composition e<sup>1</sup> ||e<sup>2</sup> , where e<sup>1</sup> and e<sup>2</sup> may communicate via shared variables; conditionals if v e<sup>1</sup> e<sup>2</sup> ; and event raising expressions event[A(v, α<sup>∗</sup> )] where the event A comes from the finite set of event labels Σ. Without loss of generality, events can be further parametrized with one value v and a set of assignments α ∗ to update the mutable variables. Moreover, a number of timed constructs can be used to capture common real-time system behaviors, which are explained via operational semantics rules in Sec. 3.2.

#### 3.2 Operational Semantics of C <sup>t</sup>

To build the semantics of the system model, we define the notion of a configuration in Definition 4, to capture the global system state during system execution.

Definition 4 (System configuration). A system configuration ζ is a pair (S, e) where S is a variable valuation function (or a stack) and e is an expression.

A transition of the system is of the form ζ l −→ ζ <sup>0</sup> where ζ and ζ <sup>0</sup> are the system configurations before and after the transition respectively. Transition labels l include: d, denoting a non-negative integer; τ , denoting an invisible event; A, denoting an observable event. For example, ζ d −→ ζ <sup>0</sup> denotes a d time-units elapse. Next, we present the firing rules, associated with timed constructs.

Process delay[v] idles for exactly t time units. Rule [delay<sup>1</sup> ] states that the process may idle for any amount of time given it is less than or equal to t; Rule [delay<sup>2</sup> ] states that the process terminates immediately when t becomes 0 .

$$\begin{array}{c} \begin{array}{c} \text{d} \leq v \\ \hline \left(\mathcal{S}, \mathsf{de1ay}[v]\right) \xrightarrow{\text{d}} \left(\mathcal{S}, \mathsf{de1ay}[v \text{-d}]\right) \end{array} \begin{array}{c} \begin{array}{c} \left[\text{de1ay}\_{I}\right] \end{array} \end{array} \xrightarrow[\begin{array}{c} \left(\mathcal{S}, \mathsf{de1ay}[\theta]\right) \xrightarrow{\tau} \left(\mathcal{S}, \left(\right)\right) \end{array} \end{array} \begin{array}{c} \left[\text{de1ay}\_{\mathcal{S}}\right] \end{array}$$

In e<sup>1</sup> timeout[v] e2, the first observable event of e<sup>1</sup> shall occur before t time units; otherwise, e<sup>2</sup> takes over the control after exactly t time units. Note that the usage of timeout in Fig. 1 is a special case where e<sup>1</sup> never starts by default.

(S, e<sup>1</sup> ) A −→ (S 0 , e 0 1 ) (S, e<sup>1</sup> timeout[v] e<sup>2</sup> ) A −→(S<sup>0</sup> , e 0 1 ) [to<sup>1</sup> ] (S, e<sup>1</sup> ) <sup>τ</sup>−→ (S 0 , e 0 1 ) (S, e<sup>1</sup> timeout[v] e<sup>2</sup> ) <sup>τ</sup>−→(S<sup>0</sup> , e 0 <sup>1</sup> timeout[v]e<sup>2</sup> ) [to<sup>2</sup> ] (S, e<sup>1</sup> ) <sup>d</sup>−→ (S, e 0 <sup>1</sup> ) (d≤v) (S, e<sup>1</sup> timeout[v] e<sup>2</sup> ) <sup>d</sup>−→(S, e 0 <sup>1</sup> timeout[v-d]e<sup>2</sup> ) [to<sup>3</sup> ] (S, e<sup>1</sup> timeout[0 ]e<sup>2</sup> ) <sup>τ</sup>−→(S, e<sup>2</sup> ) [to<sup>4</sup> ]

Process deadline [v] e behaves exactly as e except that it must terminate before t time units. The guarded process [v ]e behaves as e when v is true, otherwise it idles until v becomes true. Process e<sup>1</sup> interrupt[v] e<sup>2</sup> behaves as e<sup>1</sup> until t time units, and then e<sup>2</sup> takes over. We leave the rest rules in [16].

$$\begin{array}{lcl} \hline \cr 0 & \mathsf{S},e \\ \hline (\mathcal{S},\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \xleftarrow{\mathcal{A}\forall\prime}\langle\mathcal{S}',\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \operatorname{e}^{\prime}) \\ \hline \mathcal{S}\vdash(v\mathrel{\mathtt{in}}{\mathtt{end}}\langle\mathcal{S}',\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \operatorname{e}^{\prime}) & \langle\mathcal{S},\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \operatorname{e}^{\prime}\rangle\langle\mathcal{S}\rangle\langle\mathcal{S}\rangle\langle\mathtt{id}[\mathtt{end}] \\ \hline \mathcal{S}\vdash(v\mathrel{\mathtt{int}}{\mathtt{end}}\langle\mathcal{S},\mathsf{e}\rangle\langle\mathcal{S},\mathsf{e}\rangle\langle\mathtt{end}\rangle & \overline{\langle\mathcal{S},\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \mathsf{d}\ \xrightarrow{\scriptstyle}\langle\mathcal{S}\rangle\langle\mathcal{S},\mathsf{dead}\mathtt{add}\mathtt{in}[v]\ \mathsf{d}\ \langle\mathtt{end}\rangle\langle\mathtt{end}\rangle} \\ \hline \mathcal{S}\not\models (v\mathrel{\mathtt{int}}{\mathtt{end}}\langle\mathcal{S},\mathsf{e}\rangle\langle\mathtt{end}\rangle & \overline{\langle\mathcal{S},\mathsf{e}\rangle\langle\mathtt{end}\mathtt{add}\mathtt{in}[v]\ \operatorname{e}\rangle\langle\mathtt{end}\rangle \\ \hline (\mathcal{S},e\_{1})\stackrel{\scriptstyle\mathsf{I}}{\to}\langle\mathcal{S},v\rangle & \langle\mathcal{S},e\_{1}\rangle\langle\mathtt{int}\mathtt{array}[v]\ \operatorname{e}\rangle\langle\mathtt{}\mathcal{S}\rangle\langle\mathtt{end}\rangle \\ \hline (\mathcal{S},e\_{1})\stackrel{\scriptstyle\mathsf{I}}{\to}\langle\mathcal{S}',v\rangle & \langle\mathcal{S},e\_{1}\rangle\langle\mathtt{int}\mathtt{array}[v]\ \operatorname{e}\rangle\langle\mathtt{}$$

#### 3.3 The Specification Language

We plant TimEffs specifications into the Hoare-style verification system, using Φpre and Φpost to capture the temporal pre/post conditions. As shown in Fig. 7, TimEffs can be constructed by a conditioned event sequence π ∧ θ; or an effects disjunction Φ<sup>1</sup> ∨ Φ2. Timed sequences comprise nil (⊥); empty trace ; single event ev; concatenation θ<sup>1</sup> · θ<sup>2</sup> ; disjunction θ<sup>1</sup> ∨ θ<sup>2</sup> ; parallel composition θ<sup>1</sup> ||θ<sup>2</sup> ; a block waiting for a certain constraint to be satisfied π?θ. We introduce a new operator #, and θ#t represents the trace θ takes t time units to complete, where t

(Timed Effects) Φ ::= π ∧ θ | Φ<sup>1</sup> ∨ Φ<sup>2</sup> (Event Sequences) θ ::= ⊥ | | ev | θ<sup>1</sup> · θ<sup>2</sup> | θ<sup>1</sup> ∨ θ<sup>2</sup> | θ<sup>1</sup> ||θ<sup>2</sup> | π?θ | θ#t | θ ? (Events) ev ::= A(v, α<sup>∗</sup> ) | τ (π) | A | (Pure) π ::= T rue | F alse | bop(t1, t2) | π<sup>1</sup> ∧ π<sup>2</sup> | π1∨π<sup>2</sup> | ¬π | π1⇒π<sup>2</sup> (Real-Time Terms) t ::= c | x | t<sup>1</sup> +t<sup>2</sup> | t<sup>1</sup> -t<sup>2</sup> c ∈ Z x ∈ var (Real Time Bound) # (Kleene Star ) ?

Fig. 7. Syntax of TimEffs.

is a real-time term. A timed sequence also can be constructed by θ ? , representing zero or more times repetition of the trace θ. For single events, A(v, α<sup>∗</sup> ) stands for an observable event with label A, parameterized by v, and the assignment operations α ∗ ; τ (π) is an invisible event, parameterized with a pure formula π 5 .

Events can also be A, referring to all events which are not labeled using A; and a wildcard , which matches to all the events. We use π to denote a pure formula which captures the (Presburger) arithmetic conditions on terms or program parameters. We use bop(t<sup>1</sup> , t<sup>2</sup> ) to represent binary atomic formulas of terms (including =, >, <, ≥ and ≤). Terms consist of constant integer values c; integer variables x ; simple computations of terms, t<sup>1</sup> +t<sup>2</sup> and t<sup>1</sup> -t<sup>2</sup> .

#### 3.4 Semantic Model of Timed Effects

Let d, S, ϕ|=Φ denote the model relation, i.e., a stack S, a concrete execution trace ϕ take d time units to complete, and they satisfy the specification Φ.

d, S, ϕ |= Φ<sup>1</sup> ∨ Φ<sup>2</sup> iff d, S, ϕ |= Φ<sup>1</sup> or d, S, ϕ |= Φ<sup>2</sup> <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> iff d=<sup>0</sup> and <sup>J</sup>πK<sup>s</sup>=True and ϕ=[] <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> ev iff d=<sup>0</sup> and <sup>J</sup>πK<sup>s</sup>=True and ϕ=[ev] d, S, ϕ |= π ∧ (θ<sup>1</sup> · θ<sup>2</sup> ) iff ∃ϕ<sup>1</sup> , ϕ<sup>2</sup> . ϕ<sup>1</sup> ++ϕ2=ϕ and ∃d<sup>1</sup> , d<sup>2</sup> . d1+d2=d s.t. d<sup>1</sup> , S, ϕ<sup>1</sup> |=π ∧ θ<sup>1</sup> and d<sup>2</sup> , S, ϕ<sup>2</sup> |=π ∧ θ<sup>2</sup> d, S, ϕ |= π ∧ (θ1∨θ<sup>2</sup> ) iff d, S, ϕ |= π ∧ θ<sup>1</sup> or d, S, ϕ |= π ∧ θ<sup>2</sup> d, S, ϕ |= π∧(ev<sup>1</sup> ·θ<sup>1</sup> )||(ev<sup>2</sup> ·θ<sup>2</sup> ) iff d, S, ϕ |= π ∧ ev<sup>1</sup> · (θ<sup>1</sup> ||(ev<sup>2</sup> · θ<sup>2</sup> )) or d, S, ϕ |= π ∧ ev<sup>2</sup> · ((ev<sup>1</sup> · θ<sup>1</sup> )||θ<sup>2</sup> ) d, S, ϕ |= π∧(ev · θ<sup>1</sup> )||(ev · θ<sup>2</sup> ) iff d, S, ϕ |= π ∧ ev · (θ<sup>1</sup> ||θ<sup>2</sup> ) d, S, ϕ |= π ∧ (#t<sup>1</sup> )||(#t<sup>2</sup> ) iff d, S, ϕ |= (π∧t1≥t<sup>2</sup> ) ∧ (#t<sup>1</sup> ) · (#(t<sup>1</sup> -t<sup>2</sup> )) or d, S, ϕ |= (π∧t1<t<sup>2</sup> ) ∧ (#t<sup>2</sup> ) · (#(t<sup>2</sup> -t<sup>1</sup> )) <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> <sup>π</sup><sup>1</sup> ?<sup>θ</sup> iff <sup>J</sup>π<sup>1</sup> <sup>K</sup><sup>s</sup>=True, <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> θ or <sup>J</sup>π<sup>1</sup> <sup>K</sup><sup>s</sup>=False, <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> <sup>π</sup><sup>1</sup> ?<sup>θ</sup> <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> <sup>π</sup> <sup>∧</sup> <sup>θ</sup>#<sup>t</sup> iff <sup>J</sup><sup>π</sup> <sup>∧</sup> <sup>t</sup>≥<sup>0</sup> <sup>K</sup><sup>s</sup>=True, <sup>∃</sup>θ<sup>1</sup> , θ<sup>2</sup> . θ<sup>1</sup> · <sup>θ</sup>2=θ, fresh <sup>t</sup>1, t2, s.t. d, S, ϕ|=(π ∧ t1≥0∧t2≥0∧t1+t2=t)∧(θ<sup>1</sup> #t<sup>1</sup> )·(θ<sup>2</sup> #t<sup>2</sup> ) d, S, ϕ |= π ∧ θ ? iff d, S, ϕ |= π ∧ or d, S, ϕ |= π ∧ θ · θ ? <sup>d</sup>, <sup>S</sup>, ϕ <sup>|</sup><sup>=</sup> false iff <sup>J</sup>πK<sup>s</sup>=False or ϕ=<sup>⊥</sup>

Fig. 8. Semantics of TimEffs.

<sup>5</sup> The difference between τ (π) and π? is: τ (π) marks an assertion which leads to false (⊥) if π is not satisfied, whereas π? waits until π is satisfied.

To define the model, var is the set of program variables, val is the set of primitive values; and d, S, ϕ are drawn from the following concrete domains: d: N, S: var→val and ϕ: list of event. As shown in Fig. 8, ++ appends event sequences; [] describes the empty sequences, [ev] represents the singleton sequence contains event ev; <sup>J</sup>πK<sup>S</sup>=True represents <sup>π</sup> holds on the stack <sup>S</sup>. Notice that, simple events, i.e., without #, are taken to be happening in instant time.

3.5 . Expressiveness. TimEffs draw similarities to metric temporal logic (MTL), which is derived from LTL, where a set of non-negative real numbers is added to temporal modal operators. As shown in Table 2, we are able to encode MTL

operators into TimEffs, making it more intuitive and readable. The basic modal operators are: for "globally"; ♦ for "finally"; for "next"; U for "until", and their past time reversed versions: ←−; ←−♦ ; and for "previous"; <sup>S</sup> for "since". <sup>I</sup> in MTL is the time interval with concrete upper/lower bounds; whereas in TimEffs they can be symbolic bounds which are dependent on program inputs.

Table 2. Examples for converting MTL formulae into TimEffs with t∈I applied.


## 4 Automated Forward Verification

#### 4.1 Forward Rules

Forward rules are in the Hoare-style triples S ` {Π , Θ} e {Π 0 , Θ 0 }, where S is the stack environment; {Π , Θ} and {Π 0 , Θ 0 } are program states, i.e., disjunctions of conditioned event sequence π ∧ θ. The meaning of the transition is: {Π 0 , Θ 0 } = S|{<sup>Π</sup> ,Θ}|-<sup>1</sup> i=0 {Π 0 i , Θ 0 <sup>i</sup> } where (πi∧θi) ∈ {Π , Θ} and ` {πi, θi} e {Π 0 i , Θ 0 i } 6 .

We here present the rules for time-related constructs and leave the rest rules in [16]. Rule [FV -Delay] creates a trace #t, where t is fresh, and concatenates it to the current program state, together with the additional constraint t=v. Rule [FV -Deadline] computes the effects from e and adds an upper time-bound to the results. Rule [FV -Timeout] computes the effects from e<sup>1</sup> and e<sup>2</sup> using the starting state {π, }. The final state is an union of possible effects with corresponding time bounds and arithmetic constraints. Note that, hd(Θ<sup>1</sup> ) and tl(Θ<sup>1</sup> ) return the event head (cf. Definition 6), and the tail of Θ<sup>1</sup> respectively.

[F V -Delay] θ <sup>0</sup> = θ · (#t) (t is fresh) S ` {π, θ} delay[v] {π∧(t=v), θ<sup>0</sup>} [F V -Deadline] S ` {π, } e {Π1, Θ1} (t is fresh) S ` {π, θ} deadline[v] e {Π1∧(t≤v), θ · (Θ<sup>1</sup> #t)} [F V -Timeout] S ` {π, } e<sup>1</sup> {Π1, Θ1} S ` {π, } e<sup>2</sup> {Π2, Θ2} (t1, t<sup>2</sup> are fresh) {Π<sup>f</sup> , Θ<sup>f</sup> } = {Π1∧t1<v,(hd(Θ1)#t1) · tl(Θ1)} ∪ {Π2∧t2=v,(#t2) · Θ2} S ` {π, θ} e<sup>1</sup> timeout[v] e<sup>2</sup> {Π<sup>f</sup> , θ · Θ<sup>f</sup> } [F V -Interrupt] S ` {π, } e<sup>1</sup> {Π, Θ} ∆ = S|{Π,Θ}|-1 <sup>i</sup>=0 ℵ Interrupt(v,πi ) Interleave (θi, ) S ` {∆} e<sup>2</sup> {Π<sup>0</sup> , Θ<sup>0</sup> } S `{π, θ} e<sup>1</sup> interrupt[v] e<sup>2</sup> {Π <sup>0</sup> , θ · Θ<sup>0</sup>}

6 |{Π , Θ}| is the size of {Π , Θ}, i.e., the count of conditioned event sequence π∧θ.

[F V -Interrupt] computes the interruption interleaves of e<sup>1</sup> 's effects, which come from the over-approximation of all the possibilities. For example, for trace A · B, the interruption with time t creates three possibilities: (#t) ∨ (A#t) ∨ ((A · B)#t). Then the rule continues to compute the effects of e<sup>2</sup> ; lastly, it prepends the original history θ to the final results. Algorithm 1 presents the interleaving algorithm for interruptions, where + unions program states (cf. Definition 7 and Definition 8 for fst and D functions).

Algorithm 1: Interruption Interleaving

Input: v, π, θ, θhis Output: Program States: ∆ 1 function ℵ Interrupt(v,π) Interleave (θ, θhis ) 2 ∆ ← [] <sup>3</sup> foreach f∈fst<sup>π</sup> (θ) do 4 φ ← π∧(t<v) ∧ (θhis#t) 5 θ <sup>0</sup> ← D π f (θ) 6 θ 0 his ← θhis · f <sup>7</sup> ∆<sup>0</sup>←ℵInterrupt(v,π) Interleave (θ 0 , θ<sup>0</sup> his ) <sup>8</sup> ∆ ← ∆ + φ + ∆<sup>0</sup> 9 return ∆

Theorem 1 (Soundness of Forward Rules). Given any system configuration ζ=(S, e), by applying the operational semantics rules, if (S, e)→<sup>∗</sup> (S 0 , v) has execution time d and produces event sequence ϕ; and for any history effect π∧θ, such that d<sup>1</sup> , S, ϕ<sup>1</sup> |=(π∧θ), and the forward verifier reasons S`{π, θ}e{Π , Θ}, then ∃(π <sup>0</sup>∧θ 0 ) ∈ {Π , Θ} such that (d<sup>1</sup> +d), S 0 ,(ϕ<sup>1</sup> ++ϕ)|=(π <sup>0</sup>∧θ 0 ). (ζ−→<sup>∗</sup> ζ 0 denotes the reflexive, transitive closure of ζ −→ ζ 0 .)

Proof. See the technical report [16].

## 5 Temporal Verification via a TRS

The TRS is an automated entailment checker to prove language inclusions between TimEffs. It is triggered prior to function calls for the precondition checking; and by the end of verifying a function, for the post condition checking.

Given two effects Φ<sup>1</sup> and Φ<sup>2</sup> , the TRS decides if the inclusion Φ<sup>1</sup> v Φ<sup>2</sup> is valid. During the effects rewriting process, the inclusions are in the form of Γ ` Φ<sup>1</sup> v<sup>Φ</sup> Φ<sup>2</sup> , a shorthand for: Γ ` Φ · Φ<sup>1</sup> v Φ · Φ<sup>2</sup> . To prove such inclusions is to check whether all the possible timed traces in the antecedent Φ<sup>1</sup> are legitimately allowed in the timed traces described by the consequent Φ<sup>2</sup> . Here Γ is the proof context, i.e., a set of effects inclusion hypothesis; and Φ is the history effects from the antecedent that have been used to match the effects from the consequent. The checking is initially invoked with Γ=∅ and Φ=True ∧ .

Effects Disjunctions. An inclusion with a disjunctive antecedent succeeds if both disjunctions entail the consequent. An inclusion with a disjunctive consequent succeeds if the antecedent entails either of the disjunctions.

Γ ` Φ<sup>1</sup> v Φ Γ ` Φ<sup>2</sup> v Φ Γ ` Φ<sup>1</sup> ∨ Φ<sup>2</sup> v Φ [LHS-OR] Γ ` Φ v Φ<sup>1</sup> or Γ ` Φ v Φ<sup>2</sup> Γ ` Φ v Φ<sup>1</sup> ∨ Φ<sup>2</sup> [RHS-OR]

Now, the inclusions are disjunction-free formulas. Next we provide the definitions and key implementations of auxiliary functions Nullable, First and Derivative. Intuitively, the Nullable function δπ(θ) returns a Boolean value indicating whether π∧θ contains the empty trace; the First function fstπ(θ) computes a set of initial heads, denoted as h, of π∧θ; the Derivative function D<sup>π</sup> h (θ) computes a next-state effects after eliminating the head h from the current effects π ∧ θ.

Definition 5 (Nullable <sup>7</sup> ). Given any Φ=π ∧ θ, δπ(θ) : bool= ( true if <sup>∈</sup> <sup>J</sup>π∧θ<sup>K</sup> false if /<sup>∈</sup> <sup>J</sup>π∧θ<sup>K</sup>

$$\begin{aligned} \delta\_{\pi}(\bot) &= \delta\_{\pi}(ev) = false & \delta\_{\pi}(\epsilon) &= \delta(\theta^{\star}) = true & \delta\_{\pi}(\pi^{\prime}?\theta) &= \delta\_{\pi}(\theta) & \delta\_{\pi}(\theta\_{1} \vee \theta\_{2}) &= \delta(\theta\_{1}) \vee \delta(\theta\_{2})\\ \delta\_{\pi}(\theta \cdot \theta\_{\exists}) &= \delta(\theta\_{1}) \wedge \delta(\theta\_{\exists}) & \delta\_{\pi}(\theta\_{1} || \theta\_{\exists}) &= \delta(\theta\_{1}) \wedge \delta(\theta\_{\exists}) & \delta\_{\pi}(\theta \#t) &= SAT(\pi \wedge (t=0)) \wedge \delta\_{\pi}(\theta) \end{aligned}$$

Definition 6 (Heads). If h is a head of π ∧ θ, then there exist π 0 and θ 0 , such that π ∧ θ = π <sup>0</sup> ∧ (h · θ 0 ). A head can be t, denoting a pure time passing; A(v, α<sup>∗</sup> ), denoting an instant event passing; or (A(v, α<sup>∗</sup> ), t), denoting an event passing which takes time t.

Definition 7 (First). Given any Φ=π ∧ θ, fstπ(θ) returns a set of heads, be the set of initial elements derivable from effects π ∧ θ, where (t<sup>0</sup> is fresh):

$$\begin{aligned} fst\_{\pi}(\bot) = fst\_{\pi}(\epsilon) = \{ \} & \quad fst\_{\pi}(A(v, \alpha^{\*})) = \{ A(v, \alpha^{\*}) \} & \quad fst\_{\pi}(\epsilon \# t) = \{ t \} & \quad fst\_{\pi}(\theta^{\*}) = fst\_{\pi}(\theta) \\ fst\_{\pi}(\theta \# t) = \{ (A(v, \alpha^{\*}), t') \mid A(v, \alpha^{\*}) \in fst\_{\pi}(\theta) \} & \quad fst\_{\pi}(\theta\_{1} \vee \theta\_{2}) = fst\_{\pi}(\theta\_{1}) \cup fst\_{\pi}(\theta\_{2}) \\ fst\_{\pi}(\pi'?\theta) = fst\_{\pi}(\theta) & \quad fst\_{\pi}(\theta\_{1} \mid |\theta\_{2}) = fst\_{\pi}(\theta\_{1}) \cup fst\_{\pi}(\theta\_{2}) \\ fst\_{\pi}(\theta\_{1} \cdot \theta\_{2}) = \begin{cases} fst\_{\pi}(\theta\_{1}) \cup fst\_{\pi}(\theta\_{2}) & \quad if \quad \delta(\theta\_{1}) = \text{true} \\ fst\_{\pi}(\theta\_{1}) & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \end{aligned} \end{aligned}$$

Definition 8 (TimEffs Partial Derivative). Given any Φ=π ∧ θ, the partial derivative D<sup>π</sup> h (θ) computes the effects for the left quotient h -<sup>1</sup> (π ∧ θ), cf. Definition 1.

D π <sup>h</sup> (⊥)=D π <sup>h</sup> ()=False∧⊥ D π <sup>h</sup> (A(v, α ∗ ))=(π∧(h=A(v, α ∗ )))∧ D π <sup>h</sup> (θ ? )=D π <sup>h</sup> (θ)·θ ? D π <sup>τ</sup>(π<sup>1</sup> )(π 0 ?θ)= ( π∧π 0 ?θ if π<sup>1</sup> 6⇒π 0 π∧θ if π1⇒π <sup>0</sup> D π <sup>h</sup> (θ<sup>1</sup> ·θ<sup>2</sup> )= ( D π h (θ<sup>1</sup> )·θ2∨D π h (θ<sup>2</sup> ) if δπ(θ<sup>1</sup> )=true D π h (θ<sup>1</sup> )·θ<sup>2</sup> if δπ(θ<sup>1</sup> )=false D π (A(v,α∗),t) (θ) = \_ {D π 0 A(v,α∗) (θ 0 ) | (π <sup>0</sup> ∧ θ 0 ) ∈ D π <sup>t</sup> (θ)} D π <sup>t</sup> (θ#t 0 )=(π ∧ t+t <sup>00</sup>=t 0 ) ∧ θ#t <sup>00</sup> (t <sup>00</sup> is fresh) D π <sup>h</sup> (θ1∨θ<sup>2</sup> )=D π <sup>h</sup> (θ<sup>1</sup> ) ∨ D π <sup>h</sup> (θ<sup>2</sup> ) D π A(v,α∗) (θ#t)=\_ {(π <sup>0</sup>∧(θ 0 #t)) | (π <sup>0</sup>∧θ 0 )∈D π A(v,α∗) (θ)} D π <sup>h</sup> (θ<sup>1</sup> ||θ<sup>2</sup> )=D ¯¯ <sup>π</sup> <sup>h</sup> (θ<sup>1</sup> )||D ¯¯ <sup>π</sup> <sup>h</sup> (θ<sup>2</sup> )

Notice that the derivatives of a parallel composition makes use of the Parallel Derivative D ¯¯ <sup>π</sup> h (θ), defined as follows: D ¯¯ <sup>π</sup> h (θ)= ( π∧θ if D<sup>π</sup> h (π ∧ θ) = (False∧⊥) D π h (θ) otherwise

5.1 Rewriting Rules. Given the well-defined auxiliary functions above, we now discuss the key rewriting rules that deployed in effects inclusion proofs.

$$\begin{array}{c} \begin{array}{c} \begin{array}{c} \Gamma \vdash \pi \land \bot \sqsubseteq \mathsf{A} \mathsf{B} \mathsf{C} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \mathsf{[\mathsf{Bot-}}\text{-}\mathsf{LHS}\text{]} \end{array} \end{array} \end{array} \qquad \begin{array}{c} \begin{array}{c} \Phi \neq \pi \land \bot \\\hline \Gamma \vdash \Phi \,\mathsf{U} \,\mathsf{E} \,\pi \land \bot \end{array} \end{array} \begin{array}{c} \begin{array}{c} \begin{array}{c} \Phi \neq \pi \land \bot \\\hline \Gamma \vdash \Phi \,\mathsf{U} \,\mathsf{E} \,\pi \land \bot \end{array} \end{array} \begin{array}{c} \begin{array}{c} \mathsf{[\mathsf{Bot-}}\text{-}\mathsf{RHS}\text{]} \end{array} \end{array} \end{array} \end{array}$$

<sup>7</sup> SAT(π) stands for querying the Z3 theorem prover to check the satisfiability of π.

Axiom rules [Bot-LHS] and [Bot-RHS] are analogous to the standard propositional logic, ⊥ (referring to false) entails any effects, while no non-false effects entails ⊥. [DISPROVE] is used to disprove the inclusions when the antecedent is nullable, while the consequent is not nullable.

We use two rules to prove an inclusion: (i) [PROVE] is used when the antecedent has no head; and (ii) [REOCCUR] proves an inclusion when there exist inclusion hypotheses in the proof context Γ, which are able to soundly prove the current goal. [UNFOLD] is the inductive step of unfolding the inclusions. The proof of the original inclusion succeeds if all the derivative inclusions succeed.

(π1∧θ<sup>1</sup> v π3∧θ3) ∈ Γ (π3∧θ<sup>3</sup> v π4∧θ4) ∈ Γ (π4∧θ<sup>4</sup> v π2∧θ2) ∈ Γ Γ ` π<sup>1</sup> ∧ θ<sup>1</sup> v π<sup>2</sup> ∧ θ<sup>2</sup> [REOCCUR] H=fst<sup>π</sup><sup>1</sup> (θ1) Γ <sup>0</sup>=Γ,(π1∧θ<sup>1</sup> v π2∧θ2) ∀h∈H. (Γ <sup>0</sup> ` D π1 h (θ1) v D π2 h (θ2)) Γ ` π<sup>1</sup> ∧ θ<sup>1</sup> v π<sup>2</sup> ∧ θ<sup>2</sup> [UNFOLD]

### Theorem 2 (Termination of the TRS). The TRS is terminating.

Proof. See the technical report [16].

Theorem 3 (Soundness of the TRS). Given an inclusion Φ<sup>1</sup> v Φ<sup>2</sup> , if the TRS returns TRUE with a proof, then Φ<sup>1</sup> v Φ<sup>2</sup> is valid.

Proof. See the technical report [16].

## 6 Implementation and Evaluation

To show the feasibility, we prototype our automated verification system using OCaml (∼5k LOC); and prove soundness for both the forward verifier and the TRS. We set up two experiments to evaluate our implementation: i) functionality validation via verifying symbolic timed programs; and ii) comparison with PAT [17] and Uppaal [3] using real-life Fischer's mutual exclusion algorithm. Experiments are done on a MacBook with a 2.6 GHz 6-Core Intel i7 processor. The source code and the evaluation benchmark are openly accessible from [18].

6.1 . Experimental Results for Symbolic Timed Models. We manually annotate TimEffs specifications for a set of synthetic examples (for about 54 programs), to test the main contributions, including: computing effects from symbolic timed programs written in C t ; and the inclusion checking for TimEffs with the parallel composition, block waiting operator and shared global variables.

Table 3 presents the evaluation results for another 16 C <sup>t</sup> programs<sup>8</sup> , and the annotated temporal specifications are in a 1:1 ratio for succeeded/failed cases. The table records: No., index of the program; LOC, lines of code; Forward(ms), effects computation time; #Prop(✓), number of valid properties; Avg-Prove(ms), average proving time for the valid properties; #Prop(✗), number of invalid properties; Avg-Dis(ms), average disproving time for the invalid properties; #AskZ3, number of querying Z3 through out the experiments.

<sup>8</sup> All programs contain timed constructs, conditionals, and parallel compositions.


Table 3. Experimental Results for Manually Constructed Synthetic Examples.

Observations: i) the proving/disproving time increases when the effect computation time increases because larger Forward(ms) indicates the higher complexity w.r.t the timed constructs, which complicates the inclusion checking; ii) while the number of querying Z3 per property (#AskZ3/(#Prop(✓)+#Prop(✗))) goes up, the proving/disproving time goes up. Besides, we notice that iii) the disproving times for invalid properties are constantly lower than the proving process, regardless of the program's complexity, which is as expected in a TRS.

6.2 . Verifying Fischer's mutual exclusion algorithm. As shown in Fig. 4, the data in columns PAT(s) and Uppaal(s) are drawn from prior work [19], which indicate the time to prove Fischer's mutual exclusion w.r.t the number of processes (#Proc) in PAT and Uppaal respectively. For our system, based on the implementation presented in Fig. 5, we are able to prove the mutual exclusion properties, given the arithmetic constraint d<e. Besides, the system disproves mutual exclusion when d≤e. We record the proving (Prove(s)) and disproving (Disprove(s)) time and their number of uniquely querying Z3 (#AskZ3-u).


Table 4. Comparison with PAT via verifying Fischer's mutual exclusion algorithm

Observations: i) automata-based model checkers (both PAT and Uppaal) are vastly efficient when given concrete values for constants d and e; however ii) our proposal is able to symbolically prove the algorithm by only providing the constraints of d and e, which cannot be achieved by existing model checkers; ii) our verification time largely depends on the number of querying Z3, which is optimized in our implementation by keeping a table for already queried constraints. 6.3 . Case Study: Prove it when Reoccur. Termination of TRS is guaranteed because the set of derivatives to be considered is finite, and possible cycles are detected using memorization [14], demonstrated in Table 5. In step ○2 , in order to eliminate the first event B, A ? #tR has to be reduced to , therefore the RHS time constraint has been strengthened to tR=0. Looking at the sub-tree (I), in step ○5 , tL and tR are split into tL<sup>1</sup> +tL<sup>2</sup> and tR<sup>1</sup> +tR<sup>2</sup> . Then in step ○6 , A#tL<sup>1</sup> together with A#tR<sup>1</sup> are eliminated, unifying tL<sup>1</sup> and tR<sup>1</sup> by adding the side constraint tL<sup>1</sup> =tR<sup>1</sup> . In step ○8 , we observe the proposition is isomorphic with one of the the previous step, marked using (‡). Hence we apply the rule [REOCCUR] to prove it with a succeed side constraints entailment.

Table 5. The reoccurrence proving example. (I) is the left hand side sub-tree of the main rewriting proof tree.

(I) tL<3∧(A? #tL)·B v tR<4∧(A? #tR)·B ○4 [PROVE] True <sup>∧</sup> <sup>v</sup> tR=0 <sup>∧</sup> ○<sup>3</sup> [Normal] True <sup>∧</sup> ✁<sup>B</sup> <sup>v</sup> tR=0 <sup>∧</sup> ✘ ·✘<sup>B</sup> ○<sup>2</sup> [UNFOLD] True ∧ B v tR<4 ∧ (A? #tR) · <sup>B</sup> ○<sup>1</sup> [OR-LHS] (tL<3 ∧ (A? #tL) · B) ∨ (True ∧ B) v tR<4 ∧ (A? #tR) · B (I) : tL<3∧tL<sup>1</sup> +tL<sup>2</sup> =tL∧tR=tR<sup>1</sup> +tR<sup>2</sup>∧tL<sup>1</sup> =tR<sup>1</sup>∧tL<sup>2</sup> =tR<sup>2</sup>⇒tR<4 ○<sup>8</sup> [REOCCUR] tL<3 ∧ (A? #tL<sup>2</sup> ) · B v tR<4 ∧ (A? #tR<sup>2</sup> ) · <sup>B</sup> (‡) ○<sup>7</sup> [UNFOLD] tL<3<sup>∧</sup> ✘A#tL✘<sup>1</sup> · A ? #tL<sup>2</sup> ·BvtR<4<sup>∧</sup> ✘A#tR✘<sup>1</sup> · A ? #tR<sup>2</sup> ·<sup>B</sup> ○<sup>6</sup> [UNFOLD] <sup>π</sup><sup>u</sup> :tL<sup>1</sup> =tR<sup>1</sup> tL<3∧(A#tL<sup>1</sup> · A ? #tL<sup>2</sup> )·BvtR<4∧(A#tR<sup>1</sup> · A ? #tR<sup>2</sup> )·<sup>B</sup> ○<sup>5</sup> [SPLIT]tL<sup>1</sup> +tL<sup>2</sup> =tL∧tR<sup>1</sup> +tR<sup>2</sup> =tR tL<3 ∧ (A ? #tL) · B v tR<4 ∧ (A? #tR) · B (‡)

6.4 . Discussion. Our implementation is the first that proves the inclusion of symbolic TAs, which is considered significant because it overcomes the following main limitations of traditional timed model checking: i) TAs cannot be used to specify/verify incompletely specified systems (i.e., whose timing constants have yet to be known) and hence cannot be used in early design phases; ii) verifying a system with a set of timing constants usually requires enumerating all of them if they are supposed to be integer-valued; iii) TAs cannot be used to verify systems with timing constants to be taken in a real-valued dense interval.

## 7 Related Work

7.1 . Verification Framework. This work draws the most similarities to [20], which also deploys a forward verifier and a TRS for extended regular expressions. The differences are: i) [20] targets general-purpose sequential programs without shared variables, whereas this work targets time-critical programs with the presence of concurrency and global shared states; ii) the dependent values in [20] denote the number of repetitions of a trace, whereas in this work, they abstract the real-time bounds; iii) in this work, the TRS supports inclusion checking for the block waiting operator π? and the concurrent composition ||. These are essential in timed verification (or, more generally, for distributed systems), which are not supported in [20] or any other TRS-related works.

7.2 . Specifications and Real-Time Verification. Apart from compositional modelling for real-time systems based on timed-process algebras, such as Timed CSP [8] and CCS+Time [21], there have been a number of translation-based approaches on building verification support for timed-process algebras. For example, in [8], Timed CSP is translated to TAs (TAs) so that the model checker Uppaal [3] can be applied. On the other hand, all the translation-based approaches share the common problem: the overhead introduced by the complex translation makes it particularly inefficient when disproving properties. We are of the opinion that in that the goal of verifying real-time systems, in particular safety-critical systems is to check logical temporal properties, which can be done without constructing the whole reachability graph or the full power of model-checking. We consider our approach is simpler as it is based directly on constraint-solving techniques and can be fairly efficient in verifying systems consisting of many components as it avoids to explore the whole state-space [20,22].

This work draws similarities to Real-Time Maude [23], which complements timed automata with more expressive object-oriented specifications.

7.3 . Clock Manipulation and Zone-based Bisimulation. The concept of implicit clocks has also been used in time Petri nets, and implemented in a several model checking engines, e.g., [24]. On the other hand, to make model checking more efficient with explicit clocks, [25,26,27,28] work on dynamically deleting or merging clocks. Our work also draw connections with region/zonebased bisimulations [29], which is broadly used in reasoning timed automata.

## 8 Conclusion

This work provides an alternative approach for verifying real-time systems, where temporal behaviors are reasoned at the source level, and the specification expressiveness goes beyond traditional Timed Automata. We define the novel effects logic TimEffs, to capture real-time behavioral patterns and temporal properties. We demonstrate how to build axiomatic semantics (or rather an effects system) for C <sup>t</sup> via timed-trace processing functions. We use this semantic model to enable a Hoare-style forward verifier, which computes the program effects constructively. We present an effects inclusion checker – the TRS – to efficiently prove the annotated temporal properties. We prototype the verification system and show its feasibility. To the best of our knowledge, our work proposes the first algebraic TRS for solving inclusion relations between timed specifications.

Limitations And Future Work. Our TRS is incomplete, meaning there exist valid inclusions which will be disproved in our system. That is mainly because of insufficient unification in favour of achieving automation. We also foresee the possibilities of adding other logics into our existing trace-based temporal logic, such as separation logic for verifying heap-manipulating distributed programs.

## 9 Acknowledgements

The authors would like to thank anonymous reviewers for their comments. This work was partially supported by a Singapore Ministry of Education (MoE) Tier 3 grant "Automated Program Repair", MOET32021-0001.

## References


Proceedings, ser. Lecture Notes in Computer Science, B. Beckert, Ed., vol. 3702. Springer, 2005, pp. 78–92. [Online]. Available: https://doi.org/10.1007/11554554 8


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Parameterized Verification under TSO with Data Types

Parosh Aziz Abdulla<sup>1</sup> , Mohamad Faouzi Atig<sup>1</sup> , Florian Furbach1() , Adwait A. Godbole<sup>3</sup> , Yacoub G. Hendi<sup>1</sup> , Shankara N. Krishna<sup>2</sup> , and Stephan Spengler<sup>1</sup>

> <sup>1</sup> Uppsala University, Uppsala, Sweden florian.furbach@it.uu.se 2 Indian Institute of Technology Bombay, Mumbai, India <sup>3</sup> UC Berkeley, Berkeley, USA

We consider parameterized verification of systems executing according to the total store ordering (TSO) semantics. The processes manipulate abstract data types over potentially infinite domains. We present a framework that translates the reachability problem for such systems to the reachability problem for register machines enriched with the given abstract data type. We use the translation to obtain tight complexity bounds for TSO-based parameterized verification over several abstract data types, such as push-down automata, ordered multi pushdown automata, one-counter nets, one-counter automata, and Petri nets. We apply the framework to get complexity bounds for higher order stack and counter variants as well.

## 1 Introduction

A parameterized system consists of a fixed but arbitrary number of identical processes that execute in parallel. The goal of parameterized verification is to prove the correctness of the system regardless of the number of processes. Examples for such systems are sensor networks, leader election protocols, and mutual exclusion protocols. The topic has been the subject of intensive research for more than three decades (see e.g. [10,32,13,6]), and it is the subject of one chapter of the Handbook of Model Checking [8]. Research on parameterized verification has been mostly conducted under the premise that (i) the processes run according to the classical Sequential Consistency (SC) semantics, and (ii) the processes are finite-state machines.

Under SC, the processes operate on a set of shared variables through which they communicate atomically, i.e., read and write operations take effect immediately. In particular, a write operation is visible to all the processes as soon as the writing process carries out its operation. Therefore, the processes always maintain a uniform view of the shared memory: they all see the latest value written on any given variable, hence we can interpret program runs as interleavings of sequential process executions. Although SC has been immensely popular as an intuitive way of understanding the behaviours of concurrent processes, it is not realistic to assume computation platforms guarantee SC anymore. The reason is that, due to hardware and compiler optimizations, most modern platforms

c The Author(s) 2023

allow more relaxed program behaviours than those permitted under SC, leading to so-called weak memory models. Weakly consistent platforms are found at all levels of system design such as multiprocessor architectures (e.g., [48,47]), Cache protocols (e.g., [46,21]), language level concurrency (e.g., [41]), and distributed data stores (e.g., [17]). Therefore, in recent years, research on the parameterized verification of concurrent programs under weak memory models have started to become popular. Notable examples are the cases of the TSO semantics [4] and the Release-Acquire semantics of C11 [39].

In a parallel development, several works have extended the basic model of parameterized systems (under the SC semantics) by considering processes that are infinite-state systems. The most dominant such class has been the case where the individual processes are variants of push-down automata [36,33,28,28,40,42,30]

Parameterized verification is difficult, even under the original assumption of both SC and finite-state processes as we still need to handle an infinite state space. The extension to weakly consistent systems is even more complex due to the intricate extra process behaviours. Almost all weak memory models induce infinite state spaces even without parameterization and even when the program itself is finite-state. Therefore, performing parameterized verification under weak consistency requires handling a state space that is infinite in two dimensions; one due to parameterization and one due to the weak memory model. The same applies to the extension of parameterized verification under SC where the processes are infinite-state: in addition to infiniteness due to parameterization, we have a second source of infinity due to the infiniteness of the processes.

In this paper, we combine the above two extensions. We study parameterized verification of programs under the TSO semantics, where the processes use infinite data structures such as stacks and counters. The framework is uniform in that the manipulation can be described using an abstract data type.

We revisit the pivot abstraction technique presented in [4]. As a first contribution, we show that we can capture pivot abstraction precisely, using a class of register machines in which the registers assume values over a finite domain. We show that, for any given abstract data type A, we can reduce, in polynomial time, the parameterized verification problem under TSO and A to the reachability problem for register machines manipulating A. Furthermore, we show that the reduction also holds in the other direction: the reachability problem for register machines over A is polynomial-time reducible to the parameterized verification problem under TSO for A. In particular, the model abstracts away the semantics of TSO (in fact, it abstracts away concurrency altogether) since we are dealing with a single register machine.

We summarize the contributions of the paper as follows:


show the problem is PSpace-complete when A is a one-counter, ExpTimecomplete if A is a stack, 2-ETime-complete if A is an ordered multi stack, and ExpSpace-complete if A is a Petri net. We obtain further complexity bounds for higher order counter and stacks.

Related Work There has been an extensive research effort on parameterized verification since the 1980s (see [13,8] for recent surveys of the field). Early works showed the undecidability of the general problem (even assuming finite-state processes) [10], and hence the emphasis has been on finding useful special cases. Such cases are characterized by three aspects, namely the system topology (unordered, arrays, trees, graphs, rings, etc.), the allowed communication patterns (shared memory, Rendez-vous, broadcast, lossy channels, etc.), and the process types (anonymous, with IDs, with priorities, etc.) [27,20,31,24,23,43].

Another line of research to counter undecidability are over-approximations based on regular model checking [38,14,16,1], monotonic abstraction [5], and symmetry reduction [37,22,7].

A seminal work in the area is the paper by German and Sistla [32]. The authors consider the verification of systems consisting of an arbitrary number of finite-state processes interacting through Rendez-Vous communication. The paper shows that the model checking problem is ExpSpace-complete. In a series of more recent papers, parameterized verification has been considered in the case where the individual processes are push-down automata. [36,33,28,40,42,30].All the above works assume the SC semantics.

Due to the relevance of weak memory models in parameterized verification, papers on the topic have started to appear in the last two years. The paper [4] considers parameterized verification of programs running under TSO, and shows that the reachability problem is PSpace-complete. However, the paper assumes that the processes are finite-state and, in particular, the processes do not manipulate unbounded data domains. The model of the paper corresponds to the particular case of our framework where we take the abstract data type to be empty. In this case our framework also implies PSpace-completeness.

The paper [39] shows PSpace-completeness when the underlying semantics is the Release-Acquire fragment of C11. The latter semantics gives rise to different semantics compared to TSO. The paper also considers finite-state processes.

The paper [2] considers parameterized verification of programs running under TSO. However, the paper applies the framework of well-structured systems where the buffers of the processes are modeled as lossy channels, and hence the complexity of the algorithm is non-primitive recursive. In particular, the paper does not give any complexity bounds for the reachability problem (or any other verification problems). Conchon et al. [19] address the parameterized verification of programs under TSO as well. They make use of Model Checker Modulo Theories, no decidability or complexity results are given. The paper [15] considers checking the robustness property against SC for parameterized systems running under the TSO semantics. However, the robustness problem is entirely different from reachability and the techniques and results developed in the paper cannot be applied in our setting. The paper shows that the problem is ExpSpace-hard. All these works assume finite-state processes.

In contrast to all the above works, the current paper is the first paper that studies decidability and complexity of parameterized verification under the TSO semantics when the individual processes are infinite-state.

## 2 Preliminaries

We denote a function f between sets A and B by f : A −→ B. We write f[a ← b] to denote the function f 0 such that f 0 (a) = b and f 0 (x) = f(x) for all x 6= a.

For a finite set A, we use |A| to refer to the size of A. We also use A<sup>∗</sup> to denote the set of words over A including the empty word . For a word w ∈ A<sup>∗</sup> , we use |w| to refer to the length of w. We say a word w is differentiated if all symbols in w are pairwise different. The set Adiff is the set of all differentiated words over the set A. Finally, for a differentiated word w, we define pos(w)(a) as the unique position of the letter a in w.

A labelled transition system is a tuple hC, Cinit, Labs, −→i, where C is the set of configurations, Cinit ⊆ C is the set of initial configurations, Labs is a finite set of labels and −→ ⊆ C × Labs × C is the transition relation over the set of configurations. For a transition hc1, lab, c2i ∈ −→, we usually write c<sup>1</sup> lab −−−→ c<sup>2</sup> instead. We use c<sup>1</sup> −→ c<sup>2</sup> to denote that c<sup>1</sup> lab −−−→ c<sup>2</sup> for some lab ∈ Labs. Furthermore, we write <sup>∗</sup> −−→ to denote the transitive reflexive closure over −→, and if c1 <sup>∗</sup> −−→ c<sup>2</sup> then we say c<sup>2</sup> is reachable from c1. If c<sup>1</sup> ∈ Cinit, then we just say that c<sup>2</sup> is reachable. A run ρ is an alternating sequence of configurations and labels and is expressed as follows: c<sup>0</sup> lab<sup>1</sup> −−−→ c<sup>1</sup> lab<sup>2</sup> −−−→ c<sup>2</sup> . . . cn−<sup>1</sup> lab<sup>n</sup> −−−−→ c<sup>n</sup> . Given ρ, we write c<sup>0</sup> <sup>n</sup>−−→ c<sup>n</sup> meaning that c<sup>n</sup> is reachable from c<sup>0</sup> by n steps, and we write c0 ρ −−→ c<sup>n</sup> meaning that c<sup>n</sup> is reachable from c<sup>0</sup> through the run ρ.

## 3 Abstract Data Types (ADT)

In this section, we introduce the notion of abstract data types (ADTs) which will be used extensively in the paper. An ADT is a labelled transition system A = hVals, {valinit}, Ops, −→Ai. Intuitively, this describes the behaviour of some data type such as a stack, or a counter. Vals is the set of configurations of A. It describes the possible values the data type can assume. The initial configuration is valinit ∈ Vals. The set of labels Ops represents the operations that can be executed on the data type and the transition relation −→<sup>A</sup> ∈ Vals × Ops × Vals describes the semantics of these operations. Below, we give some concrete examples of abstract data types.

Example 1 (Counter). We define a counter, denoted by the ADT Ct, as follows. The set of configurations ValsCt = N are the natural numbers. The initial value, denoted by valCt init, is 0. The set of operations is OpsCt <sup>=</sup> {inc, dec, isZero}. The transition relation −→Ct is as follows: The operations inc and dec increase or decrease the value of the counter by one, respectively. The latter operation is only enabled if the value of the counter is non-zero, otherwise it blocks. Finally, the transition isZero checks that the value of the counter is zero, i.e. it is only enabled if that condition is true.

Example 2 (Weak Counter). A weak counter differs from a counter in that it cannot be checked for zero. The ADT wCt representing a weak counter is defined as in Example 1, except the operations of wCt are reduced to OpswCt = {inc, dec}.

Example 3 (Stack). Let Γ be a finite set representing the stack alphabet. A stack St = hValsSt , {valSt init}, OpsSt , −→Sti on Γ is defined as follows. The configurations of St are ValsSt = Γ <sup>∗</sup> and the initial configuration is the empty stack valSt init = ε. The set of operations is OpsSt = {pop(γ), push(γ), isEmpty | γ ∈ Γ}. The transition relation is as follows. For every word w ∈ Γ <sup>∗</sup> and every symbol γ ∈ Γ, push(γ) adds the symbol γ to the top of the stack. Similiarly, the pop(γ) operation removes the topmost symbol from the stack. It is only enabled if the topmost symbol on the stack. The isEmpty operation does not change the stack, but can only be performed if the stack is the empty word ε.

Example 4 (Petri Nets). Given a Petri net[44], We can define a corresponding ADT Petri that models its semantics. The values are the markings, the operations are the Petri net transitions and the transition relation is given by the input and output vectors of the Petri net transitions.

Higher Order ADTs We extend the ADT St to higher order stacks referred to as n-St. This is done recursively[18,25]. The formal definition is in the full version of our paper [3]. A value of a level n higher order stack n-St is a stack of level n − 1 stacks. For level 1, it is the standard stack St. The operations for level <sup>n</sup> are Ops<sup>n</sup>-St <sup>=</sup> {pop(γ), push(γ), pop<sup>k</sup> , push<sup>k</sup> , | γ ∈ Γ, 2 ≤ k ≤ n}. The operations pop(γ) and push(γ) are recursively applied to the top element in the stack (which consists of a stack that is one level lower) until the level of the top element is 1. Here, they have the standard stack behaviour. Operations pop<sup>k</sup> and push<sup>k</sup> are recursively applied to the top element until the level of the element is k. Then, a copy of this level k stack is pushed on top of the original.

Since a counter can be seen as a stack with an alphabet of size 1 (and a bottom element ⊥), we can extend definitions of wCt and Ct to n-wCt and n-Ct in the same way. We add operations inck, deck. All operations are recursively applied to the top counter. For inc, dec, isZero, we use standard behaviour once the level is 1. For inck, deck, we copy/remove the top element once the level is k.

Example 5 (Ordered Multi Stack). We extend the stack to a numbered list of n many stacks n-OMSt [12]. A value of n-OMSt consists of list of stacks valSt 1 . . . valSt n . An operation Ops<sup>n</sup>-OMSt <sup>=</sup> {isZero<sup>i</sup> , pop<sup>i</sup> (γ), push<sup>i</sup> (γ), | γ ∈ Γ, i ≤ n} works on stack number i in the standard way. One additional condition is that the stacks have to be ordered, meaning an operation pop<sup>i</sup> (γ) is only enabled if the stacks 1 . . . i − 1 are empty.

## 4 TSO with an Abstract Data Type : TSO(A)

In this section, we introduce concurrent programs running under TSO(A) for an ADT A = hVals, {valinit}, Ops, −→Ai. These programs consist of concurrent processes where the communication between processes is performed using shared memory under the TSO semantics. In addition, each process maintains a local variable of type A.

Syntax of TSO(A). Let Dom be a finite data domain and Vars be a finite set of shared variables over Dom. Let dinit ∈ Dom be the initial value of the variables. We define the instruction set of TSO(A) as Instrs = {rd(x, d), wr(x, d) | x ∈ Vars, d ∈ Dom} ∪ {skip, mf}, which are called read, write, skip and memory fence, respectively.

A process is represented by a finite state transition system. It is given by the tuple Proc = hQ, qinit, δi, where Q is a finite set of states, qinit ∈ Q is the initial state, and δ ⊆ Q × (Instrs ∪ Ops) × Q is the transition relation. We call this tuple the description of the process. A concurrent program is a tuple of processes P = hProcιiι∈I, where I is some finite set of process identifiers. For each ι ∈ I we have Proc<sup>ι</sup> = hQ<sup>ι</sup> , q ι init, δ<sup>ι</sup> i.

Semantics of TSO(A). We describe the semantics of a program P running under TSO(A) by a labelled transition system T<sup>P</sup> = hC <sup>P</sup> , C P init, Labs<sup>P</sup> , −→<sup>P</sup> i. The formal definition is given in [3]. Under TSO(A), there is an unbounded FIFO buffer of writes between each process and the memory. A configuration c ∈ C P of the system consists of the value of each variable in the shared memory as well as for each process: its local state, its value of the ADT, and the content of the corresponding write buffer.

The labelled transitions −→<sup>P</sup> are as follows: A local skip transition simply updates the state of the corresponding process. An ADT operation additionally updates the ADT value according to ADT behaviour −→A. When a process executes a write instruction, the operation is enqueued as a pending write message into its buffer. A message msg is an assignment of the form msg = hx, di, where x ∈ Vars and d ∈ Dom. We denote the set of all messages by Msgs = Vars×Dom. The buffer content for a process is given as a word over Msgs. The messages inside each buffer are moved non-deterministically to the main memory in a FIFO manner. Once a message reaches the memory, it becomes visible to all the other processes. When executing a read instruction on a variable x ∈ Vars, the process first checks its buffer for pending write messages on x. If the buffer contains such a message, then it reads the value of the most recent one. If the buffer contains no write messages on x, then the process fetches the value of x from the memory. The initial configuration is c P init, where each process is in its initial state, each ADT holds its initial value, each store buffer is empty and the memory holds the initial values of all variables. Note that since FIFO buffer is unbounded, this is an infinite state transition system, even for finite ADT.

A sequence of transitions c<sup>0</sup> lab<sup>1</sup> −−−→<sup>P</sup> c<sup>1</sup> lab<sup>2</sup> −−−→<sup>P</sup> c<sup>2</sup> . . . cn−<sup>1</sup> lab<sup>n</sup> −−−−→<sup>P</sup> c<sup>n</sup> where c<sup>0</sup> = c P init is the initial configuration and lab<sup>i</sup> <sup>∈</sup> Labs<sup>P</sup> is called a run in the TSO(A) transition system. If there is a run ending in a configuration with state qfinal, then we say qfinal is reachable by Proc under TSO(A).

## 5 Parameterized Reachability in TSO(A)

In this section, we consider the parameterized TSO setting which allows for an a priori unbounded number of processes with the same process description. We begin by formally introducing the parameterized state reachability problem, and then develop a generic construction that allows us to represent the TSO semantics (except for the ADT) in a finite manner.

The Parameterized State Reachability Problem Intuitively, parameterization allows for an arbitrary number of identical processes. The parameterized state reachability problem for TSO(A) called TSO(A)-P-Reach identifies a family of (standard) reachability problem instances. We want to determine whether we have reachability in some member of the family. We now introduce this formally.

For a given process description Proc, we consider the program instance, P n Proc parameterized by a natural number n as follows. For I = {1, . . . , n}, let P n Proc = hProc1, . . . , Procni with Proc<sup>ι</sup> = Proc for all ι ∈ I. That is, the n th slice of the parameterized family of programs contains n processes, all with identical descriptions Proc. We require that all processes maintain copies of the ADT A.

TSO(A)-P-Reach: Given: A process Proc = hQ, qinit, δi, an ADT A, and a state qfinal ∈ Q, Decide: Is there a n ∈ N s.t. qfinal is reachable by P n Proc under TSO(A)?

When talking about a certain family of ADTs, e.g. the family of petri nets, we write TSO(Petri)-P-Reach and mean the restriction of TSO(A)-P-Reach to petri nets, i.e. to instances where A is a petri net.

The main difference between the non-parameterized case and the parameterized case of the problem is that in the first case the index set I is a priori fixed, while in the second case it can be arbitrary. This results in C P init being a singleton in the non-parameterized case while it becomes infinite (one initial state for each n-slice) in the parameterized case.

We determine upper and lower bounds for the complexity of the state reachability problem. The challenge of solving this problem varies with the ADT. This problem for plain TSO without an ADT has been studied in [4]. They showed that the problem can be decided in PSpace and is in fact PSpace-complete. The result is based on an abstraction technique called the pivot semantics. The pivot semantics is exact in the sense that a state q is reachable under parameterized TSO if and only if it is reachable under the pivot semantics.

We show that the dynamics underlying the pivot abstraction can be generalized to our model with ADT. We show that the pivot abstraction can be extended to obtain a register machine. We use this construction to give a general characterization of TSO(A)-P-Reach. First, we recall the pivot abstraction. The Pivot Abstraction [4]. For a set of variables Vars and data domain Dom, processes generate pending write messages from the set Msgs = Vars × Dom by executing wr instructions. This set has size |Vars| · |Dom| and hence at most as many distinct (variable, value) pairs can be produced in any run. For a run ρ of the program, for each message msg = hx, di ∈ Msgs we can define the first point along ρ at which some write on variable x with value d is propagated to the memory. The pivot abstraction identifies these points as pivot points pvt(msg), for each distinct message in Msgs. For a write message msg under ρ, the pivot point pvt(msg) is the first point of propagation of msg to the memory under ρ.

The core observation is that if at some point in ρ, a process Proc<sup>ι</sup> propagates a message msg = hx, di from its buffer to the memory, then after that point, the value d will always be available to read on variable x from the shared memory. Technically, this follows from parameterization. There are arbitrarily many processes executing identical descriptions. This means transitions of the original process Proc<sup>ι</sup> can be mimicked by a clone process Proc<sup>ι</sup> <sup>0</sup> identical to Proc<sup>ι</sup> . Hence, Proc<sup>ι</sup> <sup>0</sup> can replicate the execution of Proc<sup>ι</sup> right up to the point where the message msg is the oldest message in its buffer. Then a single propagate step updates the value of x in the shared memory to d. There can be arbitrarily many such clones and the propagate step can happen at any time. It follows that beyond the pvt(msg) point in ρ, the value d can always be read from x.

For distinct messages from Msgs, we can order the pivot points corresponding to these messages according to the order in which they appear in ρ. This gives us a first update sequence, denoted by ω. No two messages in ω are the same; the set of such sequences is the set of differentiated words Msgsdiff. A message msg ∈ Msgs in ω has the rank k if it is the k-th pivot point in ω.

Providers. The pivot abstraction simulates a run ρ under the TSO semantics by running abstract processes called providers in a sequential manner. For 1 ≤ k ≤ |ω|+ 1, the k-provider simulates the process that generates the write of the rank k message hx, di corresponding to the k-pivot in ρ. The k-provider completes its task when it has simulated this process until the point it generates hx, di. At this point, it invokes the (k+ 1)-provider. With this background, we now develop the formal pivot semantics for parameterized TSO(A).

Formal Pivot semantics for Parameterized TSO(A). We define the formal operational semantics of the pivot abstraction as a labelled transition system. Given a process description Proc = hQ, qinit, δi and ADT A = hVals, {valinit}, Ops, −→Ai, a configuration of the pivot transition system represents the view of a provider when simulating a run of the program. A view v = hq, val, Lw, ω, φE, φL, φ<sup>P</sup> i is defined as follows. The process state is given by q ∈ Q. The value of the provider's ADT A is val ∈ Vals. The function Lw : Vars −→ Dom ∪{} gives for each x ∈ Vars, the value of the latest (i.e., most recent) write the provider has performed on x. If no such instruction exists (the process has made no writes to x) then Lw(x) = . Note that Lw abstracts the buffer in terms of read-own-write operations since the process can only read from the most recent pending write in its buffer on each variable (if it exists). We define Lw such that Lw(x) = for all x ∈ Vars. The first update sequence of pivot messages is ω ∈ Msgsdiff. It is unchanged by transitions and remains constant throughout the pivot run.

The external pointer, φ<sup>E</sup> ∈ {0, 1, . . . , |ω|} helps the provider keep track of which messages from ω it has observed. These messages have been propagated by other processes. The external pointer is used to identify which variables are still holding their initial values in the memory. If the provider observes an external write on a variable x (by accessing the memory), then this write has overwritten the initial value of x in the memory. The local pointer φ<sup>L</sup> : Vars −→{0, 1, . . . , |ω|} is a set of pointers, one for each variable x ∈ Vars. The function φL(x) gives the highest ranked write operation the provider itself has performed (on any variable) before it performed the latest write on x. The local pointer is necessary to know which variables lose their initial values when we need to empty the buffer. In other words, the local pointer abstracts the buffer in terms of update operations. We define φ max L := max{φL(x) | x ∈ Vars} as the highest value of a local pointer and φ 0 L such that φ 0 L (x) = 0 for all variables x ∈ Vars, i.e., the pointers are all in the leftmost position. The progress pointer φ<sup>P</sup> ∈ {1, 2, . . . , |ω| + 1} gives the rank of the process the current provider is simulating.

$$\begin{aligned} \text{skip} & \frac{\langle\mathbf{q},\mathbf{a}\mathbf{x},\mathbf{q}\mathbf{q}\rangle^{\delta}\in\delta}{\langle\!\!\!\!w\!=\!q,\!\!\!w\!=\!w,\!\!\!w\!=\!q} \langle\!\!\!\!w,\!\!\!\mathbf{L}\!w,\!\!\!\!\mathbf{L}\!w,\!\!\!\!\!\mathbf{L}\!\!\!w\!] \psi\rangle} \\\\ \text{write} & (1) \frac{\langle\!\!\!\!w\!=\!w,\!\!\!L\!\!\!\/\!w\!=\delta,\!\!\!\!\!w\!] \psi\{\!\!\!\!\!\!\/\!$$

Fig. 1: The transition relation of the pivot semantics for a process Proc.

Given an update sequence ω ∈ Msgsdiff and 1 ≤ k ≤ |ω| + 1, we define the initial view induced by ω and k denoted by vinit(ω, k), as the view hq init , valinit, Lw⊥, ω, 0, φ<sup>0</sup> L , ki. For a given ω, the k-provider starts with vinit(ω, k): Lw<sup>⊥</sup> and φ 0 L imply that the simulated process has not performed any writes and φ<sup>E</sup> = 0 means that it has not read/updated from/to the memory.

We define the labeled transition relation −→pvt on the set of views by the inference rules given in Figure 1. The set of labels is Instrs ∪ Ops. We describe the inference rules briefly. The skip rule only changes the local state of the process. There are two inference rules, write(1) and write(2), to describe the execution of a write operation wr(x, d). The rule write(1) describes the situation when the rank of hx, di is strictly smaller than the progress pointer φ<sup>P</sup> . In this case, we update both Lw and φL. The rule write(2) describes the situation when the rank of hx, di equals the progress pointer. This means that the provider has provided the message hx, di with rank φ<sup>P</sup> . Hence it has completed its mission, and initiates the next provider by transitioning to vinit(ω, φ<sup>P</sup> + 1).

There are three inference rules that describe a read operation rd(x, d). The rule read(1) describes when the last written value to x by the provider is d, Lw(x) = d. In this case, the provider simply reads from its local buffer. The rule read(2) describes the read of an initial value. It ensures that the read is possible by checking that no write operation on x is executed by the provider (Lw(x) = ⊥), and by checking that the initial value of the variable has not been overwritten in the memory. This is achieved by checking if the position of hx, di in ω, i.e. pos(ω)(hx, di), is strictly larger than φE. The rule read(3) describes when the simulated process reads from the memory. It checks that the message hx, di has been generated by some previous provider (pos(ω)(hx, di) < φ<sup>P</sup> ), and then it updates the external pointer to max(φE, φL(x), pos(ω)(hx, di)). The memory fence rule describes when the simulated process does a fence action. The rule updates the external pointer to max(φE, φmax L ). Finally, the data-operation rule describes when the simulated process does an ADT operation.

The set of initial views is Vinit = {vinit(ω, 1) | ω ∈ Msgsdiff}. This is the set of initial views of the 1-provider and it is finite because Msgsdiff is finite, unlike the set of initial configurations Cinit in the parameterized case under TSO.

## 6 Register Machines

Our goal is to design a general method to determine the decidability and complexity of TSO(A)-P-Reach depending on A. We examine the pivot abstraction introduced in the previous chapter. A view v = hq, val, Lw, ω, φE, φL, φ<sup>P</sup> i of the pivot transition system, can be partitioned into the following two components: (1) q, Lw, ω, φE, φL, φ<sup>P</sup> which contains the local state and also effectively abstracts the unbounded FIFO buffers and shared memory of the TSO system and (2) val which captures the value of the ADT. The first part is finite since each component takes finitely many values. We call this the book-keeping state since it keeps track of the progress of the core TSO system. However, the ADT part can be infinite, depending upon the abstract data type.

We will use a register machine in order to represent the book-keeping state in a finite way using states and registers. On the other hand, we will keep the ADT component general and only later instantiate it to some interesting cases.

A register machine is a finite state automaton that has access to a finite set of registers, each holding a natural number. The register machine can execute two operations on a register, it can write a given value or it can read a given value. A read is blocking if the given value is not in the register. We differ from most definitions of register machines in two significant ways: Since we only require a finite domain to model TSO(A) semantics, the values of the registers are bound from above by an N ∈ N. This makes the register assignments finite whereas most definitions allow for an unbounded domain. Further, our register machine is augmented with an ADT.

Given an ADT A = hVals, {valinit}, Ops, −→Ai, let Regs be a finite set of registers and Dom = {0, . . . , N} their domain. We define the set of actions Acts = {SKP, WRITE(r, d), READ(r, d) | r ∈ Regs, d ∈ Dom}. A register machine is then defined as a tuple R(A) = hQ, qinit, δi, where Q is a finite set of states, qinit ∈ Q is the initial state and δ ⊆ Q×(Acts∪Ops)×Q is the transition relation.

The semantics of the register machine are given in terms of a transition system. The set of configurations is Q×DomRegs ×Vals. A configuration consists of a state, a register assignment Regs −→ Dom and a value of A. The initial configuration is hqinit, 0 Regs , valiniti, where all registers contain the value 0.

The transition relation −→ is described in the following. SKP only changes the local state, not the registers or the ADT value. WRITE(r, d) sets the value of the register r to d. READ(r, d) is only enabled if the value of r is d, it does not change the value. The operations in Ops work as usually, they do not change any register. We define the state reachability problem for register machines as R(A)-Reach in the usual way. A state qfinal ∈ Q is reachable if there is a run of the transition system defined by the semantics of R(A) that starts in the initial configuration and ends in a configuration with state qfinal.

## 6.1 Simulating Pivot Abstraction by Register Machines

In this section we will show how to simulate the pivot abstraction by a register machine. The idea is to save the book-keeping state (except for the local state) in the registers. Given a process description Proc = hQProc , q Proc init , δProci for an ADT A, we construct a register machine R(A) = hQ, qinit, δi that simulates the pivot semantics as follows. The set of registers is

$$\mathsf{Regs} := \left\{ \mathsf{Lw}(\mathsf{x}), \mathsf{rk}\_{\mathsf{Var}}(\mathsf{x}), \mathsf{rk}\_{\mathsf{Mmsg}}(\mathsf{msg}), \phi\_E, \phi\_L(\mathsf{x}), \phi\_L^{\max}, \phi\_P, \mathsf{rk}\_{\mathsf{mat}} \mid \mathsf{x} \in \mathsf{Vars}, \mathsf{msg} \in \mathsf{Msgs} \right\}. \perp$$

The registers rkVars(x) and rkMsgs(msg) hold the rank of each variable and message, respectively. This implicitly gives rise to an update sequence. The auxiliary register rknxt is used to initialize the other rank registers, as will be explained later on. The remaining registers correspond to their respective counterparts in the pivot abstraction. Note that the number of registers is linear in the number of messages |Msgs|. The domain of the registers is defined to be Dom = {0, . . . , |Msgs| + 1}. Since the TSO memory domain is finite, we can assume w.l.o.g. that the memory values are positive integers. If Lw(x) = 0, it means that there has been no write on x and it still holds the initial value. The set of states Q contains QProc ∪ {q R init(A), q ptr init} as well as a number of (unnamed) auxiliary states that will be used in the following.

To simplify our construction, we will use additional operations on registers, instead of just WRITE and READ. We introduce different blocking comparisons between registers and values such as ==, <, ≤, 6=, register assignments such as r := r 0 , and increments by one denoted as r++. A more detailed description of these instructions is given in [3].

The Initializer. The pivot semantics define an exponential number of initial states: one per possible update sequence. The register machine instead guesses an update sequence at the start of the execution and stores it in the rank registers. This part of the register machine is the rank initializer (shown in Figure 2 (a)). It uses the auxiliary register rknxt to keep track of the next rank that is to be assigned. In a nondeterministic manner, the rank initializer chooses a so far unranked message and then it assigns the next rank to this message. If the variable of the message has no rank assigned yet, it updates the rank of the variable. Then it increases the rknxt register and continues. After each rank assignment, the initializer can choose to stop the rank assignment. In that case, it initializes the register φ<sup>P</sup> to 1 and finishes in the initial state of Proc.

In addition to the rank initializer, we have the pointer initializer. It is responsible for resetting all pointers except the process pointer to zero. The process pointer is incremented by one instead. This initializer is not executed in the beginning of the simulation, but between epochs of the pivot abstraction.

The simulator. The main part of this construction handles the simulation of the pivot abstraction. It contains QProc as well as several auxiliary states that are described in the following. It simulates each instruction of TSO(A). The skip instruction and the data instructions are carried out unchanged. A visualization of the remaining instructions is depicted in Figure 2. In case of a write instruction wr(x, d), we first compare the rank of the write message with the process pointer. If they are equal, it means that the epoch is finished and the next process should start, therefore we jump to the first state of the pointer initializer. Otherwise, we set the last write pointer Lw(x) to d. Now, we ensure that φ max L is at least as large as the rank of hx, di and finally we update the local pointer φL(x) to be equal to φ max L . For the memory fence instruction, it only needs to be ensured that the external pointer is at least as large as the maximum local pointer φ max L . For a read instruction rd(x, d), if the last write to x was of value d, we can execute the read directly. Otherwise, after checking that the write can be performed by the current provider, we ensure that the external pointer is at least as large as both the rank of hx, di and the local pointer of x. For the special case that d = dinit, there is an additional way in which the read can be performed: We can read dinit from the memory if the process has neither already written to x nor observed a write that has higher or equal rank than the rank of x. This gives us the following theorem, proven in Appendix C of the full version [3]:

Theorem 1. TSO(A)-P-Reach is polynomial time reducible to R(A)-Reach.

Fig. 2: The rank initializer and the simulator for some instructions instr.

#### 6.2 Simulating Register Machines by TSO

We will now show how to simulate an ADT register machine with a parameterized program running under TSO(A). The main idea is to save the information about the registers in the last pending write operations, while making sure that not a single write operation actually hits the memory. Thus, the simulator always reads the initial value or its own writes, never writes of other processes.

The TSO program has a variable for each register, and two additional variables x<sup>s</sup> and x<sup>c</sup> that act as flags: x<sup>s</sup> indicates that the verifier should start working, while x<sup>c</sup> indicates that the verifier has successfully completed the verification. At the beginning of the execution, each process nondeterministically chooses to be either simulator, scheduler, or verifier. Each role will be described in the following. The complete construction is shown in Appendix C of [3].

The simulator uses the same states and transitions as R(A), but instead of reading from and writing to registers, it uses the memory. If the simulator reaches the target state qtarget, it first checks the x<sup>s</sup> flag. If it is already set, the simulator stops, never reaching the final state qfinal. Otherwise, it waits until it observes the flag x<sup>c</sup> to be set. It then enters the final state. The scheduler's only responsibility is to signal the start of the verification process. It does so by setting the flag x<sup>s</sup> at a nondeterministically chosen time during the execution of the program. The verifier waits until it observers the flag xs. It then starts the verification process, which consists of checking each variable that corresponds to a register. If all of them still contain their initial value, the verification was successful. The verifier signals this to the simulator process by setting the x<sup>c</sup> flag.

Any execution ending in qfinal must perform a simulation of R(A) ending in qtarget first, then a scheduler propagates the setting of flag x<sup>s</sup> and afterwards a verifier executes. This ensures that the initial values are read by the verifier after the register machine has been simulated and thus the shared memory is unchanged. This means the simulator only accessed its write buffer and not writes from other threads. It follows that qtarget is reachable by R(A) if and only if qfinal is reachable by Proc under TSO(A). This gives us the following result:

#### Theorem 2. R(A)-Reach is polynomial time reducible to TSO(A)-P-Reach.

Theorem 1 and Theorem 2 give us a method of determining upper and lower bounds of the complexity of TSO(A)-P-Reach for different instantiations of ADT. Since we have reductions in both directions, we can conclude that TSO(A)-P-Reach is decidable if and only if R(A)-Reach is decidable. We know TSO(A)-P-Reach is PSpace-hard for TSO(NoAdt)-P-Reach where NoAdt is the trivial ADT that models plain TSO semantics [4]. We can immediately derive a lower bound for any ADT: TSO(A)-P-Reach is PSpace-hard.

## 7 Instantiations of ADTs

In the following, we instantiate our framework to a number of ADTs in order to show its applicability.

## Theorem 3. TSO(Ct)-P-Reach and TSO(wCt)-P-Reach are PSpace-complete.

We know TSO(A)-P-Reach is PSpace-hard for any ADT A including Ct and wCt. Regarding the upper bound for Ct, we can show that R(Ct)-Reach can be polynomially reduced to R(NoAdt)-Reach. The idea is to show that there is a bound on the counter values in order to find a witness for R(Ct)-Reach. This bound is polynomial in the number of possible states and register assignments (i.e., this bound is at most exponential in the size of R(Ct).) Assume a run that contains a configuration c with a value that exceeds the bound, then certain state and register assignment are repeated in the run with different values. We can use this to shorten the run such that the counter value in c is reduced.

We can encode the counter value (up to this bound) in a binary way into registers acting as bits. The number of additional registers is polynomial in the size of R(Ct). In order to simulate an inc operation on this binary encoding using WRITE and READ, we only have to go through the bits starting at the least important bit and flip them until one is flipped from 0 to 1. The dec operation works analogously. This only requires a polynomial state and transition overhead.

We know that R(NoAdt)-Reach is in PSpace[4]. It follows from the polynomial reduction that R(Ct)-Reach is in PSpace. Applying Theorem 1 gives us that TSO(Ct)-P-Reach is in PSpace. Since any wCt is a Ct, it follows TSO(wCt)-P-Reach is in PSpace as well. The proof is in [3].

## Theorem 4. TSO(St)-P-Reach is ExpTime-complete.

For membership, we encode the registers of R(St) in the states, which yields a finite state machine with access to a stack, i.e. a pushdown automaton. The construction has an exponential number of states. From [45], we have that checking the emptiness of a context-free language generated by a pushdown automaton is polynomial in terms of the size of the automaton. Combined, we get that state reachability of the constructed pushdown automaton is in ExpTime. It follows that R(St)-Reach is in ExpTime (thanks to Theorem 1).

To prove the lower bound, we can reduce the problem of checking the emptiness of the intersection of a pushdown with n finite-state automata [35] to R(St)-Reach. This problem is well-known to be ExpTime-complete. The idea is to use the stack to simulate pushdown automaton and n registers to keep track of the states of the finite-state automata. We apply Theorem 2 and get TSO(St)-P-Reach is ExpTime-hard. The formal proof is in [3]

## Theorem 5. TSO(Petri)-P-Reach is ExpSpace-complete.

Proof. Petri net coverability is known to be ExpSpace complete [26]. We show hardness by reducing coverability of a marking m to R(Petri)-Reach. The idea is to construct a register machine with a Petri net as ADT. This register machine will have two states qinit and qfinal. For every transition t of the original Petri net, we have t: qinit <sup>t</sup> −−→ qinit as a transition of the register machine (we simply simulate the original Petri net). Furthermore, we have qinit <sup>t</sup>−<sup>m</sup> −−−−→ <sup>q</sup>final as a transition of the register machine. Thus, the state qfinal can be reached iff m can be covered.

We reduce reachability of R(Petri) to Petri net coverability. We construct the Petri net by taking the ADT Petri and adding a place p<sup>q</sup> for every state q and a place preg,d for every register reg ∈ Regs and register value d ∈ Dom. The idea is that a marking with a token in p<sup>q</sup> and one in preg,d but none preg,d<sup>0</sup> for d <sup>0</sup> 6= d corresponds to a configuration of R(Petri) with state q and reg = d. The value of Petri is given by the remainder of the marking.

We simulate any q instr −−−−→ q <sup>0</sup> with a transition t that takes one token from q and puts one in q 0 . If instr ∈ Ops, then instr is a Petri net transition. We simply add the same input and output arcs to t. To simulate a write, we add a new transition td<sup>0</sup> for every d <sup>0</sup> ∈ Dom with an arc to preg,d and an arc from preg,d<sup>0</sup> . The initial marking is consistent with valPetri init and has one token in p<sup>q</sup>init . A state q is reachable if a marking with one token in p<sup>q</sup> is coverable.

Higher Order ADTs. Let M(A)-Reach problem be the restriction of R(A)-Reach with no registers. The M(A)-Reach problem has been studied for many ADT such as higher order counter and higher order stack variations[34,25].

#### Theorem 6.


Proof. M(n-St)-Reach has been shown to be (n − 1)-ExpTime-complete [25]. We know M(n-wCt)-Reach is (n − 2)-ExpTime-complete and M(n-Ct)-Reach is (n − 2)-ExpSpace-complete [34]. Since the reduction from M(A)-Reach to R(A)-Reach is trivial, any hardness result can be applied to TSO(A)-P-Reach immediately using Theorem 2. In order to reduce R(A)-Reach to M(A)-Reach, we encode register assignments into the state which results in an exponential state explosion. Then we apply Theorem 1 to obtain our upper bound.

Theorem 7. TSO(n-OMSt)-P-Reach is 2-ETime-complete.

Proof. We know that M(n-OMSt)-Reach is 2-ETime-complete [12] and we can apply Theorem 2 to get 2-ETime-hardness. According to Theroem 4.6 in [11], M(n-OMSt)-Reach is in O(|M(A)| 2 dn ) for some constant d ∈ N. We apply the exponential size reduction to R(n-OMSt)-Reach and Theorem 1 and get TSO(n-OMSt)-P-Reach is in O((2|P|) 2 dn ) = O(2|P|·<sup>2</sup> dn ) and thus it is also in O(2<sup>2</sup> |P| ·2 dn ) = O(2<sup>2</sup> |P|+dn ). Thus, TSO(n-OMSt)-P-Reach is in 2-ETime.

We study well structured ADTs [29,9] as defined in [3]:

Theorem 8. If ADT A is well structured, then TSO(A)-P-Reach is decidable.

A register machine for a well structured ADT A is equivalent to the composition of a well structured transition system (WSTS) modeling A and a finite transition system (and thus a WSTS) that models states and registers. According to [9], the composition is again a WSTS and reachability is decidable. The above theorem is then an immediate corollary of Theorem 1.

## 8 Conclusions and Future Work

In this paper, we have taken the first step to studying the complexity of parameterized verification under weak memory models when the processes manipulate unbounded data domains. Concretely, we have presented complexity results for parameterized concurrent programs running on the classical TSO memory model when the processes operate on an abstract data type. We reduce the problem to reachability for register machines enriched with the given abstract data type.

State reachability for finite automata with ADT has been extensively studied for many ADTs[34,25]. We have shown in Theorem 6 that we can apply our framework to existing complexity results of this problem. This provides us with decidability and complexity results for the corresponding instances of TSO(A)-P-Reach. However, due to the exponential number of register assignments, the upper bound is exponentially larger than the lower bound. We aim to study these cases further and determine more refined parametric bounds.

A direction for future work is considering other memory models, such as the partial store ordering semantics, the release-acquire semantics, and the ARM semantics. It is also interesting to re-consider the problem under the assumption of having distinguished processes (so-called leader processes). Adding leaders is known to make the parameterized verification problem harder. The complexity/decidability of parameterized verification under TSO with a single leader is open, even when the processes are finite-state.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Verifying Learning-Based Robotic Navigation Systems

Guy Amir<sup>1</sup>,∗(B) , Davide Corsi<sup>2</sup>,<sup>∗</sup> , Raz Yerushalmi<sup>1</sup>,<sup>3</sup> , Luca Marzari<sup>2</sup> , David Harel<sup>3</sup> , Alessandro Farinelli<sup>2</sup> , and Guy Katz<sup>1</sup>

<sup>1</sup> The Hebrew University of Jerusalem, Jerusalem, Israel {guyam,guykatz}@cs.huji.ac.il <sup>2</sup> University of Verona, Verona, Italy {davide.corsi,luca.marzari,alessandro.farinelli}@univr.it <sup>3</sup> The Weizmann Institute of Science, Rehovot, Israel {raz.yerushalmi,david.harel}@weizmann.ac.il

Abstract. Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for tasks where complex policies are learned within reactive systems. Unfortunately, these policies are known to be susceptible to bugs. Despite signifcant progress in DNN verifcation, there has been little work demonstrating the use of modern verifcation tools on real-world, DRL-controlled systems. In this case study, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation — a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efciently and safely navigate through an unknown arena towards a target. We demonstrate how modern verifcation engines can be used for efective model selection, i.e., selecting the best available policy for the robot in question from a pool of candidate policies. Specifcally, we use verifcation to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infnite loops. We also apply verifcation to identify models with overly conservative behavior, thus allowing users to choose superior policies, which might be better at fnding shorter paths to a target. To validate our work, we conducted extensive experiments on an actual robot, and confrmed that the suboptimal policies detected by our method were indeed fawed. We also demonstrate the superiority of our verifcation-driven approach over state-of-the-art, gradient attacks. Our work is the frst to establish the usefulness of DNN verifcation in identifying and fltering out suboptimal DRL policies in real-world robots, and we believe that the methods presented here are applicable to a wide range of systems that incorporate deep-learning-based agents.

## 1 Introduction

In recent years, deep neural networks (DNN) have become extremely popular, due to achieving state-of-the-art results in a variety of felds — such as natural

<sup>[\*]</sup> Both authors contributed equally.

language processing [16], image recognition [51], autonomous driving [11], and more. The immense success of these DNN models is owed in part to their ability to train on a fxed set of training samples drawn from some distribution, and then generalize, i.e., correctly handle inputs that they had not encountered previously. Notably, deep reinforcement learning (DRL) [37] has recently become a dominant paradigm for training DNNs that implement control policies for complex systems that operate within rich environments. One domain in which DRL controllers have been especially successful is robotics, and specifcally — robotic navigation, i.e., the complex task of efciently navigating a robot through an arena, in order to safely reach a target [63, 68].

Unfortunately, despite the immense success of DNNs, they have been shown to sufer from various safety issues [31, 57]. For example, small perturbations to their inputs, which are either intentional or the result of noise, may cause DNNs to react in unexpected ways [45]. These inherent weaknesses, and others, are observed in almost every kind of neural network, and indicate a need for techniques that can supply formal guarantees regarding the safety of the DNN in question. These weaknesses have also been observed in DRL systems [6,21,34], showing that even state-of-the-art DRL models may err miserably.

To mitigate such safety issues, the verifcation community has recently developed a plethora of techniques and tools [8,10,19,24,28,29,31,35,39,40,64,66] for formally verifying that a DNN model is safe to deploy. Given a DNN, these methods usually check whether the DNN: (i) behaves according to a prescribed requirement for all possible inputs of interest; or (ii) violates the requirement, in which case the verifcation tool also provides a counterexample.

To date, despite the abundance of both DRL systems and DNN verifcation techniques, little work has been published on demonstrating the applicability and usefulness of verifcation techniques to real-world DRL systems. In this case study, we showcase the capabilities of DNN verifcation tools for analyzing DRLbased systems in the robotics domain — specifcally, robotic navigation systems. To the best of our knowledge, this is the frst attempt to demonstrate how ofthe-shelf verifcation engines can be used to identify both unsafe and suboptimal DRL robotic controllers, that cannot be detected otherwise using existing, incomplete methods. Our approach leverages existing DNN verifers that can reason about single and multiple invocations of DRL controllers, and this allows us to conduct a verifcation-based model selection process — through which we flter out models that could render the system unsafe.

In addition to model selection, we demonstrate how verifcation methods allow gaining better insights into the DRL training process, by comparing the outcomes of diferent training methods and assessing how the models improve over additional training iterations. We also compare our approach to gradientbased methods, and demonstrate the advantages of verifcation-based tools in this setting. We regard this as another step towards increasing the reliability and safety of DRL systems, which is one of the key challenges in modern machine learning [27]; and also as a step toward a more wholesome integration of verifcation techniques into the DRL development cycle.

In order to validate our experiments, we conducted an extensive evaluation on a real-world, physical robot. Our results demonstrate that policies classifed as suboptimal by our approach indeed exhibited unwanted behavior. This evaluation highlights the practical nature of our work; and is summarized in a short video clip [4], which we strongly encourage the reader to watch. In addition, our code and benchmarks are available online [3].

The rest of the paper is organized as follows. Section 2 contains background on DNNs, DRLs, and robotic controlling systems. In Section 3 we present our DRL robotic controller case study, and then elaborate on the various properties that we considered in Section 4. In Section 5 we present our experimental results, and use them to compare our approach with competing methods. Related work appears in Section 6, and we conclude in Section 7.

## 2 Background

Deep Neural Networks. Deep neural networks (DNNs) [25] are computational, directed, graphs consisting of multiple layers. By assigning values to the frst layer of the graph and propagating them through the subsequent layers, the network computes either a label prediction (for a classifcation DNN) or a value (for a regression DNN), which is returned to the user. The values computed in each layer depend on values computed in previous layers, and also on the current layer's type. Common layer types include the weighted sum layer, in which each neuron is an afne transformation of the neurons from the preceding layer; as well as the popular rectifed linear unit (ReLU ) layer, where each node y computes the value y = ReLU(x) = max(0, x), based on a single node x from the preceding layer to which it is connected. The DRL systems that are the subject matter of this case study consist solely of weighted sum and ReLU layers, although the techniques mentioned are suitable for DNNs with additional layer types, as we discuss later.

Fig. 1 depicts a small example of a DNN. For input V<sup>1</sup> = [2, 3]<sup>T</sup> , the second (weighted sum) layer computes the values V<sup>2</sup> = [20, −7]<sup>T</sup> . In the third layer, the ReLU functions are applied, and the result is V<sup>3</sup> = [20, 0]<sup>T</sup> . Finally, the network's single output is computed as a weighted sum: V<sup>4</sup> = [40].

Fig. 1: A toy DNN.

Deep Reinforcement Learning. Deep reinforcement learning (DRL) [37] is a particular paradigm and setting for training DNNs. In DRL, an agent is trained to learn a policy π, which maps each possible environment state s (i.e., the current observation of the agent) to an action a. The policy can have diferent interpretations among various learning algorithms. For example, in some cases, π represents a probability distribution over the action space, while in others it encodes a function that estimates a desirability score over all the future actions from a state s.

During training, at each discrete time-step t ∈ {0, 1, 2, . . .}, a reward r<sup>t</sup> is presented to the agent, based on the action a<sup>t</sup> it performed at time-step t. Different DRL training algorithms leverage the reward in diferent ways, in order to optimize the DNN-agent's parameters during training. The general DNN architecture described above also characterizes DRL-trained DNNs; the uniqueness of the DRL paradigm lies in the training process, which is aimed at generating a DNN that computes a mapping π that maximizes the expected cumulative discounted reward R<sup>t</sup> = E -P t γ t · r<sup>t</sup> . The discount factor, γ ∈ - 0, 1 , is a hyperparameter that controls the infuence that past decisions have on the total expected reward.

DRL training algorithms are typically divided into three categories [55]:


All of these approaches are commonly used in modern DRL; and each has its advantages and disadvantages. For example, the value-based methods typically require only small sets of examples to learn from, but are unable to learn policies for continuous spaces of ⟨state,action⟩ pairs. In contrast, the policy-gradient methods can learn continuous policies, but sufer from a low sample efciency and large memory requirements. Actor-Critic algorithms attempt to combine the benefts of value-based and policy-gradient methods, but sufer from high instability, particularly in the early stages of training, when the value function learned by the critic is unreliable.

DNN Verifcation and DRL Verifcation. A DNN verifcation algorithm receives as input [31]: (i) a trained DNN N; (ii) a precondition P on the DNN's inputs, which limits their possible assignments to inputs of interest; and (iii) a postcondition Q on N's output, which usually encodes the negation of the behavior we would like N to exhibit on inputs that satisfy P. The verifcation algorithm then searches for a concrete input x<sup>0</sup> that satisfes P(x0)∧ Q(N(x0)), and returns one of the following outputs: (i) SAT, along with a concrete input x<sup>0</sup> that satisfes the given constraints; or (ii) UNSAT, indicating that no such x<sup>0</sup> exists. When Q encodes the negation of the required property, a SAT result indicates that the property is violated (and the returned input x<sup>0</sup> triggers a bug), while an UNSAT result indicates that the property holds.

For example, suppose we wish to verify that the DNN in Fig. 1 always outputs a value strictly smaller than 7; i.e., that for any input x = ⟨v 1 1 , v<sup>2</sup> 1 ⟩, it holds that N(x) = v 1 <sup>4</sup> < 7. This is encoded as a verifcation query by choosing a precondition that does not restrict the input, i.e., P = (true), and by setting Q = (v 1 <sup>4</sup> ≥ 7), which is the negation of our desired property. For this verifcation query, a sound verifer will return SAT, alongside a feasible counterexample such as x = ⟨0, 2⟩, which produces v 1 <sup>4</sup> = 22 ≥ 7. Hence, the property does not hold for this DNN.

To date, the DNN verifcation community has focused primarily on DNNs used for a single, non-reactive, invocation [24,28,31,40,64]. Some work has been carried out on verifying DRL networks, which pose greater challenges: beyond the general scalability challenges of DNN verifcation, in DRL verifcation we must also take into account that agents typically interact with a reactive environment [6,9,15,21,30]. In particular, these agents are implemented with neural networks that are invoked multiple times, and the inputs of each invocation are usually afected by the outputs of the previous invocations. This fact aggregates the scalability limitations (because multiple invocations must be encoded in each query), and also makes the task of defning P and Q signifcantly more complex [6].

## 3 Case Study: Robotic Mapless Navigation

Robotis Turtlebot 3. In our case study, we focus on the Robotis Turtlebot 3 robot (Turtlebot, for short), depicted in Fig. 2. Given its relatively low cost and efcient sensor confguration, this robot is widely used in robotics research [7,46]. In particular, this robotic platform has the actuators required for moving and turning, as well as multiple lidar sensors for detecting obstacles. These sensors use laser beams to approximate the distance to the nearest object in their direction [65]. In our experiments, we used a confguration with seven lidar sensors, each with a maximal range of one meter. Each pair of sensors are 30◦ apart, thus allowing coverage of 180◦ . The images in Fig. 3 depict a simulation of the Turtlebot navigating through an arena, and highlight the lidar beams. See the full version of this paper [5] for additional details.

The Mapless Navigation Problem. Robotic navigation is the task of navigating a robot (in our case, the Turtlebot) through an arena. The robot's goal is to reach a target destination while adhering to predefned restrictions; e.g., selecting as short a path as possible, avoiding obstacles, or optimizing energy consumption. In recent years, robotic navigation tasks have received a great deal of attention [63,68], primarily due to their applicability to autonomous vehicles.

Fig. 2: The Robotis Turtlebot 3 platform, navigating in an arena. The image on the left depicts a static robot, and the image on the right depicts the robot moving towards the destination (the yellow square), while avoiding two wooden obstacles in its route.

We study here the popular mapless variant of the robotic navigation problem, where the robot can rely only on local observations (i.e., its sensors), without any information about the arena's structure or additional data from external sources. In this setting, which has been studied extensively [58], the robot has access to the relative location of the target, but does not have a complete map of the arena. This makes mapless navigation a partially observable problem, and among the most challenging tasks to solve in the robotics domain [13, 58, 70].

DRL-Controlled Mapless Navigation. State-of-the-art solutions to mapless navigation suggest training a DRL policy to control the robot. Such DRLbased solutions have obtained outstanding results from a performance point of view [47]. For example, recent work by Marchesini et al. [43] has demonstrated how DRL-based agents can be applied to control the Turtlebot in a mapless navigation setting, by training a DNN with a simple architecture, including two hidden layers. Following this recent work, in our case study we used the following topology for DRL policies:


<sup>1</sup> It has been shown that discrete controllers achieve excellent performance in robotic navigation, often outperforming continuous controllers in a large variety of tasks [43].

Fig. 3: An example of a simulated Turtlebot entering a 2-step loop. The white and red dashed lines represent the lidar beams (white indicates "clear", and red indicates that an obstacle is detected). The yellow square represents the target position; and the blue arrows indicate rotation. In the first row, from left to right, the Turtlebot is stuck in an infinite loop, alternating between right and left turns. Given the deterministic nature of the system, the agent will continue to select these same actions, ad infinitum. In the second row, from left to right, we present an almost identical configuration, but with an obstacle located 30◦ to the robot's left (circled in blue). The presence of the obstacle changes the input to the DNN, and allows the Turtlebot to avoid entering the infinite loop; instead, it successfully navigates to the target.

While the aforementioned DRL topology has been shown to be efficient for robotic navigation tasks, finding the optimal training algorithm and reward function is still an open problem. As part of our work, we trained multiple deterministic policies using the DRL algorithms presented in Section 2: DDQN [60], Reinforce [67], and PPO [50]. For the reward function, we used the following formulation:

$$
\mathbb{R}\_t = (d\_{t-1} - d\_t) \cdot \alpha - \beta,
$$

where d<sup>t</sup> is the distance from the target at time-step t; α is a normalization factor used to guarantee the stability of the gradient; and β is a fixed value, decreased at each time-step, and resulting in a total penalty proportional to the length of the path (by minimizing this penalty, the agent is encouraged to reach the target quickly). In our evaluation, we empirically selected α = 3 and β = 0.001. Additionally, we added a final reward of +1 when the robot reached the target, or −1 in case it collided with an obstacle. For additional information regarding the training phase, see the full version of this paper [5].

DRL Training and Results. Using the training algorithms mentioned in Section 2, we trained a collection of DRL agents to solve the Turtlebot mapless navigation problem. We ran a stochastic training process, and thus obtained varied agents; of these, we only kept those that achieved a success rate of at least 96% during training. A total of 780 models were selected, consisting of 260 models per each of the three training algorithms. More specifically, for each

Fig. 4: (a) The DRL controller used for the robot in our case study. The DRL has nine input neurons: seven lidar sensor readings (blue), one input indicating the relative angle (orange) between the robot and the target, and one input indicating the distance (green) between the robot and the target. (b) The average success rates of models trained by each of the three DRL training algorithms, per training episode.

algorithm, all 260 models were generated from 52 random seeds. Each seed gave rise to a family of 5 models, where the individual family members differ in the number of training episodes used for training them. Fig. 4b shows the trained models' average success rate, for each algorithm used. We note that PPO was generally the fastest to achieve high accuracy. However, all three training algorithms successfully produced highly accurate agents.

## 4 Using Verification for Model Selection

All of our trained models achieved very high success rates, and so, at face value, there was no reason to favor one over the other. However, as we show next, a verification-based approach can expose multiple subtle differences between them. As our evaluation criteria, we define two properties of interest that are derived from the main goals of the robotic controller: (i) reaching the target; and (ii) avoiding collision with obstacles. Employing verification, we use these criteria to identify models that may fail to fulfill their goals, e.g., because they collide with various obstacles, are overly conservative, or may enter infinite loops without reaching the target. We now define the properties that we used, and the results of their verification are discussed in Section 5. Additional details regarding the precise encoding of our queries appear the full version of this paper [5].

Collision Avoidance. Collision avoidance is a fundamental and ubiquitous safety property [14] for navigation agents. In the context of Turtlebot, our goal is to check whether there exists a setting in which the robot is facing an obstacle, and chooses to move forward — even though it has at least one other viable option, in the form of a direction in which it is not blocked. In such situations, it is clearly preferable to choose to turn LEFT or RIGHT instead of choosing to move FORWARD and collide. See Fig. 5 for an illustration.

Fig. 5: Example of a single-step collision. The robot is not blocked on its right and can avoid the obstacle by turning (panel A), but it still chooses to move forward — and collides (panel B).

Given that turning LEFT or RIGHT produces an in-place rotation (i.e., the robot does not change its position), the only action that can cause a collision is FORWARD. In particular, a collision can happen when an obstacle is directly in front of the robot, or is slightly off to one side (just outside the front lidar's field of detection). More formally, we consider the safety property "the robot does not collide at the next step", with three different types of collisions:


Recall that in mapless navigation, all observations are local — the robot has no sense of the global map, and can encounter any possible obstacle configuration (i.e., any possible sensor reading). Thus, in encoding these properties, we considered a single invocation of the DRL agent's DNN, with the following constraints:


The exact encoding of these properties is based on the physical characteristics of the robot and the lidar sensors, as explained in the full version of this paper [5].

Infnite Loops. Whereas collision avoidance is the natural safety property to verify in mapless navigation controllers, checking that progress is eventually made towards the target is the natural liveness property. Unfortunately, this property is difcult to formulate due to the absence of a complete map. Instead, we settle for a weaker property, and focus on verifying that the robot does not enter infnite loops (which would prevent it from ever reaching the target).

Unlike the case of collision avoidance, where a single step of the DRL agent could constitute a violation, here we need to reason about multiple consecutive invocations of the DRL controller, in order to identify infnite loops. This, again, is difcult to encode due to the absence of a global map, and so we focus on in-place loops: infnite sequences of steps in which the robot turns LEFT and RIGHT, but without ever moving FORWARD, thus maintaining its current location ad infnitum.

Our queries for identifying in-place loops encode that: (i) the robot does not reach the target in the frst step; (ii) in the following k steps, the robot never moves FORWARD, i.e., it only performs turns; and (iii) the robot returns to an already-visited confguration, guaranteeing that the same behavior will be repeated by our deterministic agents. The various queries difer in the choice of k, as well as in the sequence of turns performed by the robot. Specifcally, we encode queries for identifying the following kinds of loops:


We also note that all the loop-identifcation queries include a condition for ensuring that the robot is not blocked from all directions. Consequently, any loops that are discovered demonstrate a clearly suboptimal behavior.

Specifc Behavior Profles. In our experiments, we noticed that the safe policies, i.e., the ones that do not cause the robot to collide, displayed a wide spectrum of diferent behaviors when navigating to the target. These diferences occurred not only between policies that were trained by diferent algorithms, but also between policies trained by the same reward strategy — indicating that these diferences are, at least partially, due to the stochastic realization of the DRL training process.

Specifcally, we noticed high variability in the length of the routes selected by the DRL policy in order to reach the given target: while some policies demonstrated short, efcient, paths that passed very close to obstacles, other policies demonstrated a much more conservative behavior, by selecting longer paths, and avoiding getting close to obstacles (an example appears in Fig. 6).

Thus, we used our verifcationdriven approach to quantify how conservative the learned DRL agent is in the mapless navigation setting. Intuitively, a highly conservative policy will keep a signifcant safety margin from obstacles (possibly taking a longer route to reach its destination), whereas a "braver" and less conservative controller would risk venturing


Fig. 6: Comparing paths selected by policies with diferent bravery levels. Path A takes the Turtlebot close to the obstacle (red area), and is the shortest. Path B maintains a greater distance from the obstacle (light red area), and is consequently longer. Finally, path C maintains such a signifcant distance from the obstacle (white area) that it is unable to reach the target.

closer to obstacles. In the case of Turtlebot, the preferable DRL policies are the ones that guarantee the robot's safety (with respect to collision avoidance), and demonstrate a high level of bravery — as these policies tend to take shorter, optimized paths (see path A in Fig. 6), which lead to reduced energy consumption over the entire trail.

Bravery assessment is performed by encoding verifcation queries that identify situations in which the Turtlebot can move forward, but its control policy chooses not to. Specifcally, we encode single invocations of the DRL model, in which we bound the lidar inputs to indicate that the Turtlebot is sufciently distant from any obstacle and can safely move forward. We then use the verifer to determine whether, in this setting, a FORWARD output is possible. By altering and adjusting the bounds on the central lidar sensor, we can control how far away the robot perceives the obstacle to be. If we limit this distance to large values and the policy will still not move FORWARD, it is considered conservative; otherwise, it is considered brave. By conducting a binary search over these bounds [6], we can identify the shortest distance from an obstacle for which the policy safely orders the robot to move FORWARD. This value's inverse then serves as a bravery score for that policy.

Design-for-Verifcation: Sliding Windows. A signifcant challenge that we faced in encoding our verifcation properties, especially those that pertain to multiple consecutive invocations of the DRL policy, had to do with the local nature of the sensor readings that serve as input to the DNN. Specifcally, if

the robot is in some initial confguration that leads to a sensor input x, and then chooses to move forward and reaches a successor confguration in which the sensor input is x ′ , some connection between x and x ′ must be expressed as part of the verifcation query (i.e., nearby obstacles that exist in x cannot suddenly vanish in x ′ ). In the absence of a global map, this is difcult to enforce.

In order to circumvent this difculty, we used the sliding window principle, which has proven quite useful in similar settings [6, 21]. Intuitively, the idea is to focus on scenarios where the connections between x and x ′ are particularly straightforward to encode — in fact, most of the sensor information that appeared in x also appears in x ′ . This approach allows us to encode multistep queries, and is also benefcial in terms of performance: typically, adding sliding-window constraints reduces the search space explored by the verifer, and expedites solving the query.

In the Turtlebot setting, this is achieved by selecting a robot confguration in which the angle between two neighboring lidar sensors is identical to the turning angle of the robot (in our case, 30◦ ). This guarantees, for example, that if the central lidar sensor observes an obstacle at distance d and the robot chooses to turn RIGHT, then at the next step, the lidar sensor just to the left of the central sensor must detect the same obstacle, at the same distance d. More generally, if at time-step t the 7 lidar readings (from left to right) are ⟨l1, . . . , l7⟩ and the robot turns RIGHT, then at time-step t + 1 the 7 readings are ⟨l2, l3, . . . , l7, l8⟩, where only l<sup>8</sup> is a new reading. The case for a LEFT turn is symmetrical. By placing these constraints on consecutive states encountered by the robot, we were able to encode complex properties that involve multiple time-steps, e.g., as in the aforementioned infnite loops. An illustration appears in Fig. 3.

## 5 Experimental Evaluation

Next, we ran verifcation queries with the aforementioned properties, in order to assess the quality of our trained DRL policies. The results are reported below. In many cases, we discovered confgurations in which the policies would cause the robot to collide or enter infnite loops; and we later validated the correctness of these results using a physical robot. We strongly encourage the reader to watch a short video clip that demonstrates some of these results [4]. Our code and benchmarks are also available online [3]. In our experiments, We used the Marabou verifcation engine [33] as our backend, although other engines could be used as well. For additional details regarding the experiments, we refer the reader to the full version of this paper [5].

Model Selection. In this set of experiments, we used verifcation to assess our trained models. Specifcally, we used each of the three training algorithms (DDQN, Reinforce, PPO) to train 260 models, creating a total of 780 models. For each of these, we verifed six properties of interest: three collision properties (FORWARD COLLISION, LEFT COLLISION, RIGHT COLLISION), and three loop properties (ALTERNATING LOOP, LEFT CYCLE, RIGHT CYCLE), as described in Section 4. This gives a total of 4680 verifcation queries. We ran all queries with a


Table 1: Results of the policy verifcation queries. We verifed six properties over each of the 260 models trained per algorithm; SAT indicates that the property was violated, whereas UNSAT indicates that it held (to reduce clutter, we omit TIMEOUT and FAIL results). The rightmost column reports the stability values of the various training methods. For the full results see [3].

TIMEOUT value of 12 hours and a MEMOUT limit of 2G; the results are summarized in Table 1. The single-step collision queries usually terminated within seconds, and the 2-step queries encoding an ALTERNATING LOOP usually terminated within minutes. The 12-step cycle queries, which are more complex, usually ran for a few hours. 9.6% of all queries hit the TIMEOUT limit (all from the 12-step cycle category), and none of the queries hit the MEMOUT limit.<sup>2</sup>

Our results exposed various diferences between the trained models. Specifically, of the 780 models checked, 752 (over 96%) violated at least one of the single-step collision properties. These 752 collision-prone models include all 260 DDQN-trained models, 256 Reinforce models, and 236 PPO models. Furthermore, when we conducted a model fltering process based on all six properties (three collisions and three infnite loops), we discovered that 778 models out of the total of 780 (over 99.7%!) violated at least one property. The only two models that passed our fltering process were trained by the PPO algorithm.

Further analyzing the results, we observed that PPO models tended to be safer to use than those trained by other algorithms: they usually had the fewest violations per property. However, there are cases in which PPO proved less successful. For example, our results indicate that PPO-trained models are more prone to enter an ALTERNATING LOOP than those trained by Reinforce. Specifically, 214 (82.3%) of the PPO models have entered this undesired state, compared to 145 (55.8%) of the Reinforce models. We also point out that, similarly to the case with collision properties, all DDQN models violated this property.

Finally, when considering 12-step cycles (either LEFT CYCLE or RIGHT CYCLE), 44.8% of the DDQN models entered such cycles, compared to 30.7% of the Reinforce models, and just 12.4% of the PPO models. In computing these results, we

<sup>2</sup> We note that two queries failed due to internal errors in Marabou.

computed the fraction of violations (SAT queries) out of the number of queries that did not time out or fail, and aggregated SAT results for both cycle directions.

Interestingly, in some cases, we observed a bias toward violating a certain subcase of various properties. For example, in the case of entering full cycles although 125 (out of 520) queries indicated that Reinforce-trained agents may enter a cycle in either direction, in 96% of these violations, the agent entered a RIGHT CYCLE. This bias is not present in models trained by the other algorithms, where the violations are roughly evenly divided between cycles in both directions.

We fnd that our results demonstrate that diferent "black-box" algorithms generalize very diferently with respect to various properties. In our setting, PPO produces the safest models, while DDQN tends to produce models with a higher number of violations. We note that this does not necessarily indicate that PPOtrained models perform better, but rather that they are more robust to corner cases. Using our fltering mechanism, it is possible to select the safest models among the available, seemingly equivalent candidates.

Next, we used verifcation to compute the bravery score of the various models. Using a binary search, we computed for each model the minimal distance a deadahead obstacle needs to have for the robot to safely move forward. The search range was [0.18, 1] meters, and the optimal values were computed up to a 0.01 precision (see the full version of this paper [5] for additional details). Almost all binary searches terminated within minutes, and none hit the TIMEOUT threshold.

By frst fltering the models based on their safe behavior, and then by their bravery scores, we are able to fnd the few models that are both safe (do not collide), and not overly conservative. These models tend to take efcient paths, and may come close to an obstacle, but without colliding with it. We also point out that over-conservativeness may signifcantly reduce the success rate in specifc scenarios, such as cases in which the obstacle is close to the target. Specifcally, of the only two models that survived the frst fltering stage, one is considerably more conservative than the other — requiring the obstacle to be twice as distant as the other, braver, model requires it to be, before moving forward.

Algorithm Stability Analysis. As part of our experiments, we used our method to assess the three training algorithms — DDQN, PPO, and Reinforce. Recall that we used each algorithm to train 52 families of 5 models each, in which the models from the same family are generated from the same random seed, but with a diferent number of training iterations. While all models obtained a high success rate, we wanted to check how often it occurred that a model successfully learned to satisfy a desirable property after some training iterations, only to forget it after additional iterations. Specifcally, we focused on the 12-step full-cycle properties (LEFT CYCLE and RIGHT CYCLE), and for each family of 5 models checked whether some models satisfed the property while others did not.

We defne a family of models to be unstable in the case where a property holds in the family, but ceases to hold for another model from the same family with a higher number of training iterations. Intuitively, this means that the model "forgot" a desirable property as training progressed. The instability value of each algorithm type is defned to be the number of unstable 5-member families.

Although all three algorithms produced highly accurate models, they displayed signifcant diferences in the stability of their produced policies, as can be seen in the rightmost column of Table 1. Recall that we trained 52 families of models using each algorithm, and then tested their stability with respect to two properties (corresponding to the two full cycle types). Of these, the DDQN models display 21 unstable alternations — more than twice the number of alterations demonstrated by Reinforce models (10), and signifcantly higher than the number of alternations observed among the PPO models (1).

These results shed light on the nature of these training algorithms — indicating that DDQN is a signifcantly less stable training algorithm, compared to PPO and Reinforce. This is in line with previous observations in non-verifcationrelated research [50], and is not surprising, as the primary objective of PPO is to limit the changes the optimizer performs between consecutive training iterations.

Gradient-Based Methods. We also conducted a thorough comparison between our verifcation-based approach and competing gradient-based methods. Although gradient-based attacks are extremely scalable, our results (summarized in [5]) show that they may miss many of the violations found by our complete, verifcation-based procedure. For example, when searching for collisions, our approach discovered a total of 2126 SAT results, while the gradient-based method discovered only 1421 SAT results — a 33% decrease (!). In addition, given that gradient-based methods are unable to return UNSAT, they are also incapable of proving that a property always holds, and hence cannot formally guarantee the safety of a policy in question. Thus, performing model selection based on gradient-based methods could lead to skewed results. We refer the reader to the full version of this paper [5], in which we elaborate on gradient attacks and the experiments we ran, demonstrating the advantages of our approach for model selection, when compared to gradient-based methods.

## 6 Related Work

Due to the increasing popularity of DNNs, the formal methods community has put forward a plethora of tools and approaches for verifying DNN correctness [20,24,26,28,31–33,36,39,52,59]. Recently, the verifcation of systems involving multiple DNN invocations, as well as hybrid systems with DNN components, has been receiving signifcant attention [6, 9, 17, 18, 22, 34, 54, 61]. Our work here is another step toward applying DNN verifcation techniques to additional, realworld systems and properties of interest.

In the robotics domain, multiple approaches exist for increasing the reliability of learning-based systems [48,62,69]; however, these methods are mostly heuristic in nature [1,23,42]. To date, existing techniques rely mostly on Lagrangian multipliers [38,49,53], and do not provide formal safety guarantees; rather, they optimize the training in an attempt to learn the required policies [12]. Other, more formal approaches focus solely on the systems' input-output relations [15, 41], without considering multiple invocations of the agent and its interactions with the environment. Thus, existing methods are not able to provide rigorous guarantees regarding the correctness of multistep robotic systems, and do not take into account sequential decision making — which renders them insufcient for detecting various safety and liveness violations.

Our approach is orthogonal and complementary to many existing safe DRL techniques. Reward reshaping and shielding techniques (e.g., [2]) improve safety by altering the training loop, but typically aford no formal guarantees. Our approach can be used to complement them, by selecting the most suitable policy from a pool of candidates, post-training. Guard rules and runtime shields are benefcial for preventing undesirable behavior of a DNN agent, but are sometimes less suited for specifying the desired actions it should take instead. In contrast, our approach allows selecting the optimal policy from a pool of candidates, without altering its decision-making.

## 7 Conclusion

Through the case study described in this paper, we demonstrate that current verifcation technology is applicable to real-world systems. We show this by applying verifcation techniques for improving the navigation of DRL-based robotic systems. We demonstrate how of-the-shelf verifcation engines can be used to conduct efective model selection, as well as gain insights into the stability of state-of-the-art training algorithms. As far as we are aware, ours is the frst work to demonstrate the use of formal verifcation techniques on multistep properties of actual, real-world robotic navigation platforms. We also believe the techniques developed here will allow the use of verifcation to improve additional multistep systems (autonomous vehicles, surgery-aiding robots, etc.), in which we can impose a transition function between subsequent steps. However, our approach is limited by DNN-verifcation technology, which we use as a black-box backend. As that technology becomes more scalable, so will our approach. Moving forward, we plan to generalize our work to richer environments — such as cases where a memory-enhanced agent interacts with moving objects, or even with multiple agents in the same arena, as well as running additional experiments with deeper networks, and more complex DRL systems. In addition, we see probabilistic verifcation of stochastic policies as interesting future work.

Acknowledgements. The work of Amir, Yerushalmi and Katz was partially supported by the Israel Science Foundation (grant number 683/18). The work of Amir was supported by a scholarship from the Clore Israel Foundation. The work of Corsi, Marzari, and Farinelli was partially supported by the "Dipartimenti di Eccellenza 2018-2022" project, funded by the Italian Ministry of Education, Universities, and Research (MIUR). The work of Yerushalmi and Harel was partially supported by a research grant from the Estate of Harry Levine, the Estate of Avraham Rothstein, Brenda Gruss and Daniel Hirsch, the One8 Foundation, Rina Mayer, Maurice Levy, and the Estate of Bernice Bernath, grant 3698/21 from the ISF-NSFC (joint to the Israel Science Foundation and the National Science Foundation of China), and a grant from the Minerva foundation. We thank Idan Refaeli for his contribution to this project.

## References


work for Verifcation and Analysis of Deep Neural Networks. In Proc. 31st Int. Conf. on Computer Aided Verifcation (CAV), pages 443–452, 2019.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Make Flows Small Again: Revisiting the Flow Framework

Roland Meyer<sup>1</sup> , Thomas Wies<sup>2</sup> , and Sebastian Wolff2()

<sup>1</sup> TU Braunschweig, Braunschweig, Germany, roland.meyer@tu-bs.de <sup>2</sup> New York University, New York, USA, {wies,sebastian.wolff}@cs.nyu.edu

Abstract We present a new fow framework for separation logic reasoning about programs that manipulate general graphs. The framework overcomes problems in earlier developments: it is based on standard fxed point theory, guarantees least fows, rules out vanishing fows, and has an easy to understand notion of footprint as needed for soundness of the frame rule. In addition, we present algorithms for automating the frame rule, which we evaluate on graph updates extracted from linearizability proofs for concurrent data structures. The evaluation demonstrates that our algorithms help to automate key aspects of these proofs that have previously relied on user guidance or heuristics.

Keywords: Separation Logic · Graph Algorithms · Frame Inference.

## 1 Introduction

The fow framework [23, 24] is an abstraction mechanism based on separation logic [5, 32, 40] that enables reasoning about global inductive invariants of general graphs in a local manner. The framework has proved useful to verify intricate algorithms that are diffcult to handle by other techniques, such as the Priority Inheritance Protocol, object-oriented design patterns, and complex concurrent data structures [22,24,27,34]. However, these efforts have also exposed some rough corners in the underlying meta theory that either limit expressivity or automation. In this paper, we propose a new meta theory for the fow framework that aims to strike a balance between these conficting requirements. In addition, we present algorithms that aid proof automation.

Background. The central notion of the fow framework is that of a *fow*. Given a commutative monoid (M, +, 0) (e.g. natural numbers with addition), and a graph with nodes X and an *edge function* E : X × X → M → M, a fow is a function fl : X → M that satisfes the *fow equation*:

$$\forall x \in X. \quad f(x) = in\_x + \sum\_{y \in X} E\_{(y,x)}(f(y))\ .$$

That is, fl is a fxed point of the function that assigns every node x an initial value in<sup>x</sup> ∈ M, its *infow*, and then propagates these values through the graph according to the edge function. This is akin to a forward data fow analysis where the monoid operation + is used as the join. By choosing an appropriate fow monoid, infow, and edge function, one can express inductive properties of graphs (reachability, sortedness, etc.) in terms of conditions that refer only to each node's fow value fl(x).

A graph endowed with an infow and associated fow is a *fow graph*. An example fow graph h is shown on the right-hand side of Fig. 1a. Here, the fow value fl(w) for

628 646

© The Author(s) 2023

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. – , 2023. https://doi.org/10.1007/978-3-031-30823-9\_32

Figure 1. (a) Two fow graphs h<sup>1</sup> with nodes h1.X ={ x, y, z } (left) and h<sup>2</sup> with nodes h2.X = { r, u, v } (center) for the fow monoid of natural numbers with addition. The edge label λid stands for the identity function. Omitted edges are labeled by the constant 0 function. Dashed edges represent the infows. Nodes are labeled by their fow, respectively, outfow. The right side shows the composition h = h<sup>1</sup> ∗ h2. (b) Two fow graphs h<sup>1</sup> with h1.X = { u, x } (top) and h<sup>2</sup> with h2.X = { v, w } (bottom) whose composition is undefned due to vanishing fows.

a node w counts the number of paths from r to w. A fow graph can be partial and have edges to nodes outside of X like the node u for h<sup>1</sup> in Fig. 1a. If we include these nodes in the computation of the fow, then their fow values constitute the *outfow* of the fow graph. For instance, the outfow of h<sup>1</sup> for u is 1.

Flow graphs are equipped with a notion of disjoint composition, h = h<sup>1</sup> ∗ h2. An example is given in Fig. 1a. The composition is only defned if the union of the fows of h<sup>1</sup> and h<sup>2</sup> is again a fow of h. This may not always be the case. For instance, the infows and outfows of h<sup>1</sup> and h<sup>2</sup> may be mutually incompatible such as h<sup>1</sup> sending outfow 2 to u whereas the infow to u in h<sup>2</sup> is only 1.

Flow graph composition yields a *separation algebra*. That is, if we use fow graphs as an abstraction of program states (e.g., the heap), then we can use separation logic to reason locally about properties of programs that are expressed in terms of the induced fow graphs. For example, suppose the program updates the fow graph h in Fig. 1a to a new fow graph h by inserting a new edge labeled λid between the nodes r and u. This increases the fow of u and v from 1 to 2. We can break this update down as follows. First, we decompose h into h<sup>1</sup> and h2. Next, we obtain h <sup>2</sup> from h<sup>2</sup> by inserting the edge and updating the fow of u and v to 2. Finally, we compose h <sup>2</sup> again with h<sup>1</sup> to obtain h . Note that the composition h<sup>1</sup> ∗ h <sup>2</sup> is still defned. This means that any property expressed over the fow in the h1-portion of h still holds in h . This is the well-known *frame rule* of separation logic, instantiated for fow graphs.

The crux in applying the frame rule is to show that the composition h<sup>1</sup> ∗ h <sup>2</sup> is indeed defned. One can do this locally by showing that the update <sup>h</sup><sup>2</sup> <sup>h</sup> <sup>2</sup> is *framepreserving*, i.e., for *any* h<sup>1</sup> such that h<sup>1</sup> ∗ h<sup>2</sup> is defned, h<sup>1</sup> ∗ h <sup>2</sup> is also defned.

Typically, the fow subgraphs involved in a frame-preserving update <sup>h</sup><sup>2</sup> <sup>h</sup> <sup>2</sup> include more nodes than those immediately affected by the update. For instance, consider the subgraphs of h and h in our example that consist only of the nodes {r, u} directly affected by inserting the edge. These subgraphs do not constitute a frame-preserving update because inserting the edge between r and u also changes the outfow to v from 1 to 2. Hence, the updated subgraph for {r, u} would no longer compose with the rest of h where v's fow is still 1 instead of 2. We refer to a set of nodes such as {r, u, v} that identifes a frame-preserving update as the update's *footprint*.

Meta theories of fow graphs. In addition to ensuring that fow graph composition yields a separation algebra, there are two desiderata that one has to take into consideration when designing a meta theory of fow graphs:


The frst subgoal is crucial for expressivity and the second one for proof automation. Achieving one subgoals makes it more diffcult to achieve the other. Specifcally, consider the meta theory proposed in [24]. It requires that the fow monoid (M, +, 0) is also cancellative (m +n<sup>1</sup> =o and m +n<sup>2</sup> =o implies n<sup>1</sup> =n2). Requiring cancellativity has the advantage that it is easy to check if an update <sup>h</sup> <sup>h</sup> is frame-preserving: it suffces to show that h and h have the same infow and outfow. Cancellativity also ensures that for each fow fl, there exists a unique infow that produces fl. Hence, it is suffcient to track only fl since the infow is a derived quantity. However, the converse does not hold.

In fact, obtaining unique fows for cancellative M becomes more diffcult. A natural requirement that one would like to impose on M is that the pre-order induced by + forms a complete partial order (cpo) or even a complete lattice. This way, one can focus on the least fow, which is guaranteed to exist if one applies standard fxed point theorems, imposing only mild assumptions on the edge functions. However, cancellativity is inherently incompatible with standard domain-theoretic prerequisites. For instance, the only ordered cancellative commutative monoid that is a directed cpo is the trivial one: M<sup>0</sup> = {0}. Similarly, M<sup>0</sup> is the only such monoid that has a greatest element.

For cases where unique fows are desired, [24] imposes additional requirements on the edge functions (nil-potent) or the graph structure (effectively acyclic). The former is quite restrictive in terms of expressivity. The latter again complicates the computation of frame-preserving updates: one now has to ensure that no cycles are introduced when the updated graph h <sup>2</sup> is composed with its frame h1. In fact, for the effectively acyclic case, [24] only provides a suffcient condition that a given footprint yields a framepreserving update but it gives no algorithm for computing such a footprint.

Contributions. In this paper, we propose a new meta theory of fows based on fow monoids that form ω-cpos (but need not be cancellative). The cpo requirement yields the desired least fxed point semantics. The differences in the requirements on the fow monoid necessitate a new notion of fow graph composition. In particular, for a least fxed point semantics of fows, h = h<sup>1</sup> ∗ h<sup>2</sup> is only defned if the fows of h<sup>1</sup> and h<sup>2</sup> do not vanish. An example of such a situation is shown in Fig. 1b, where the fows in h<sup>1</sup> and h<sup>2</sup> would vanish to 0 in h<sup>1</sup> ∗ h<sup>2</sup> because the created cycle has no external infow. Moreover, an update <sup>h</sup> <sup>h</sup> is frame-preserving if <sup>h</sup> and <sup>h</sup> route infows to outfows in the same way. We formalize this condition using a notion of contextual equivalence of the graphs' *transfer functions*, which are the least fxed points of the fow equation, parameterized by the infows and restricted to the nodes outside the graphs. We then identify conditions on the edge functions that are commonly satisfed in practice and that allow us to effectively check contextual equivalence of transfer functions. This result is remarkable because the fow monoid can have infnite ascending chains and the fow graphs can be cyclic. Building on this equivalence check, we propose an iterative algorithm for computing footprints of updates. This algorithm enables the automation of the frame rule for reasoning about programs manipulating fow graphs. We evaluate the presented algorithms on a benchmark suite of fow graph updates that are extracted from linearizability proofs for concurrent search structures constructed by the tool plankton [26,27]. The evaluation demonstrates that our algorithms help to automate key aspects of these proofs that have previously relied on user guidance or heuristics.

## 2 Flow Graph Separation Algebra

We start with the presentation of our new separation algebra of fow graphs.

Given a commutative monoid (M, +, 0), we defne the binary relation ≤ on M by n ≤ m if there is o ∈ M with m = n +o. Flow values are drawn from a *fow monoid*, a commutative monoid for which the relation ≤ is an ω-cpo. That is, ≤ is a partial order and every ascending chain K = m<sup>0</sup> ≤ m<sup>1</sup> ≤ ... in M has a least upper bound, denoted K. We expect n+K= (n+K). In the following, we fx a fow monoid (M, +, 0).

Let ContFun(M → M) be the continuous functions in M → M. Recall that a function f : M → M is *continuous* [43] if it commutes with limits of ascending chains, f( <sup>K</sup>) = <sup>f</sup>(K) for every chain <sup>K</sup> in <sup>M</sup>. We lift <sup>+</sup> and <sup>≤</sup> to functions <sup>M</sup> <sup>→</sup> <sup>M</sup> in the expected way. An empty iterated sum <sup>i</sup>∈<sup>∅</sup> <sup>m</sup><sup>i</sup> is defned to be <sup>0</sup>.

Lemma 1. (ContFun(M → M), ◦, id) *is a monoid. Moreover, if* (M, ≤) *is an* ω*-cpo, so is* (ContFun(M → M), ≤)*.*

A *fow graph* is a tuple h = (X ,E, in) consisting of a fnite set of nodes X ⊆ N, a set of edges E : X × N → ContFun(M → M) labeled by continuous functions, and an *infow* in : (N \ X ) × X → M. We use FG for the set of all fow graphs and denote the empty fow graph by h<sup>∅</sup> (∅, ∅, ∅).

We defne two derived functions for fow graphs. First, the *fow* is the least function flow : X → M satisfying the fow equation: flow(x ) = in<sup>x</sup> + rhs<sup>x</sup> (flow), for all <sup>x</sup> <sup>∈</sup> <sup>X</sup> . Here, in<sup>x</sup> <sup>y</sup>∈(N\<sup>X</sup> ) in(y, <sup>x</sup> ) is a monoid value and rhs<sup>x</sup> <sup>y</sup>∈<sup>X</sup> <sup>E</sup>(y,x) is a function of type ContFun((X → M) → M). Finally, we also defne the *outfow* out : X × (N \ X ) → M by out(x , y) E(x,y)(flow(x )).

*Example 1.* For linearizability proofs of concurrent search structures one can use a fow that labels every data structure node x with its *inset*, the set of keys k such that a thread searching for k may traverse the node x [22,23]. Translated to our setting, the relevant fow monoid is the powerset of keys, P(Z ∪ { −∞, ∞ }), with set union as addition. Figure 2 shows two keyset fow graphs that abstract potential states of a concurrent set implementation based on sorted linked lists. When a key k is removed from the set, the node x that stores k is frst marked to indicate that x has been logically deleted. In

Figure 2. Two fow graphs h<sup>1</sup> (left) and h<sup>2</sup> (right) with h1.X = h2.X = { l, t, r } for the keyset fow monoid P(Z ∪ { −∞, ∞ }). The edge label λ<sup>k</sup> for a key k denotes the function λm.(m \ [−∞, k]).

a second step, x is then physically unlinked from the list. The idea of the abstraction is that an edge leaving a node x that stores a key k is labeled by the function λ<sup>k</sup> if x is unmarked and otherwise by λ−∞. This is because a search for k ∈ Z will traverse the edge leaving x iff k<k or x is marked. In the fgure, l and r are assumed to be unmarked, storing keys 6 and 8, respectively. Node t is assumed to be marked. Flow graph h<sup>2</sup> is obtained from h<sup>1</sup> by physically unlinking the marked node t. Using the keyset fow one can then express the crucial data structure invariants that are needed for a linearizability proof based on local reasoning (e.g., the invariant that the logical contents of a node is always a subset of its inset).

We note that the infow of the global fow graph that abstracts the program state can be used in the specifcation. In the example, one lets in<sup>r</sup> = Z for the root r of the data structure and in<sup>x</sup> = ∅ for all other nodes to indicate that all searches start at r.

Composition without vanishing fows. To defne the composition of fow graphs, h<sup>1</sup> ∗ h2, we proceed in two steps. We frst defne an auxiliary composition that may suffer from *vanishing* fows, local fows that disappear in the composition. That is, this composition is defned for the fow graphs shown in Fig. 1b. In the composed graph the fow of each node is 0 where it was 1 before the composition—the fow vanishes. This means that the auxiliary composition does not allow to lift lower bounds on the fow values from the individual components to the composed graph. Hence, the actual composition restricts the auxiliary composition to rule out such vanishing fows. Defnedness of the auxiliary composition requires disjointness of the nodes in h<sup>1</sup> and h2. Moreover, the outfow of one fow graph has to match the infow expectations of the other:

$$\begin{aligned} h\_1 \# \# \, h\_2 &\quad \text{if} \quad X\_1 \cap X\_2 = \mathcal{Q} \quad \wedge \,\, \forall x \in X\_1, \,\, y \in X\_2. \,\, out\_1(x, y) = in\_2(x, y) \wedge \, \\ &\qquad \qquad \qquad \qquad \qquad out\_2(y, x) = in\_1(y, x) \,. \end{aligned}$$

The auxiliary composition h<sup>1</sup> h<sup>2</sup> removes the infow provided by the other component:

$$\begin{array}{rcl} h\_1 \uplus h\_2 & \triangleq & (X\_1 \uplus X\_2, E\_1 \uplus E\_2, (\operatorname{in}\_1 \uplus \operatorname{in}\_2)|\_{\left(\operatorname{N}((X\_1 \uplus X\_2)\right) \times (X\_1 \uplus X\_2)\right)} & \cdot \end{array}$$

To rule out vanishing fows, we incorporate a suitable equality on the fows:

h<sup>1</sup> # h<sup>2</sup> if h<sup>1</sup> # # h<sup>2</sup> ∧ h1.flow h2.flow = (h<sup>1</sup> h2).flow .

Only if the latter equality holds, do we have the composition h<sup>1</sup> ∗ h<sup>2</sup> h<sup>1</sup> h2. It is worth noting that h1.flow h2.flow ≥ (h<sup>1</sup> h2).flow always holds. What defnedness really asks for is the reverse inequality.

Recall from [5] that a *separation algebra* is a partial commutative monoid (Σ, ∗, emp) with a set of units emp ⊆ Σ.

Lemma 2. (FG, ∗, { h<sup>∅</sup> }) *is a separation algebra.*

## 3 Frame-Preserving Updates

Since fow graphs form a separation algebra, we can use separation logic assertions to describe sets of fow graphs as in [24] and then use them to prove separation logic Hoare triples. A key proof rule used in such proofs is the frame rule. Given separation logic assertions P<sup>1</sup> and P2, and a command c, the frame rule states: if the Hoare triple {P1} c {P2} is valid, then so is {P<sup>1</sup> ∗ F} c {P<sup>2</sup> ∗ F} for any *frame* F. The remainder of the paper focuses on developing algorithms for automating this proof rule.

The fow graphs described by an assertion may have unbounded size (e.g., due to the use of *iterated separating conjunctions*). We only consider bounded fow graphs in the following; the unbounded case is known to be a challenge for which orthogonal techniques are being developed (cf. Sect. 6). However, even if the fow graphs have bounded size, there may still be infnitely many of them because the infows and edge functions are encoded symbolically in a logical theory of the fow monoid. For pedagogy, we present our algorithms in terms of concrete fow graphs rather than symbolic ones. However, our development readily extends to symbolic representations assuming the underlying fow monoid theory is decidable. In fact, our implementation discussed in Sect. 5 works with symbolic fow graphs.

The soundness of the frame rule relies on the assumption that the state update induced by the command c satisfes a certain locality condition. In our setting, this condition amounts to checking that the update of P<sup>1</sup> under c is *frame-preserving* with respect to fow graph composition. For the fow graphs h<sup>1</sup> described by P<sup>1</sup> and all fow graphs h<sup>2</sup> in the post image of h<sup>1</sup> under c, this means that h<sup>1</sup> # h implies h<sup>2</sup> # h for all h. Intuitively, h<sup>2</sup> # h still holds if h<sup>1</sup> and h<sup>2</sup> transfer infows to outfows in the same way.

Formally, for a fow graph h we defne its *transfer function* tf (h) mapping infows to outfows, tf (h) : ((N \ X ) × X → M) → X × (N \ X ) → M, by

$$tf(h)(in') \triangleq h[in \mapsto in'].out\ .$$

For a given infow in, we also write tf (h1) =in tf (h2) to mean that for all infows in ≤ in, tf (h1)(in ) = tf (h2)(in ).

Defnition 1. *Flow graphs* h1, h<sup>2</sup> *are* contextually equivalent*, denoted* h<sup>1</sup> =ctx h2*, if we have* h1.X = h2.X *,* h1.in = h2.in*, and* tf (h1) =<sup>h</sup>1.in tf (h2)*.*

Theorem 1 (Frame Preservation). *For all fow graphs* h<sup>1</sup> =ctx h<sup>2</sup> *and* h*,* h<sup>1</sup> # h *if and only if* h<sup>2</sup> # h *and, in case of defnedness,* h<sup>1</sup> ∗ h =ctx h<sup>2</sup> ∗ h*.*

To automate the frame rule for a command c and a precondition P, we need to identify a decomposition P = P<sup>1</sup> ∗ F so as to infer {P1} c {P2} and then apply the frame rule to derive {P} c {Q} for the postcondition Q = P<sup>2</sup> ∗ F. This is closely related to the *frame inference problem* [4]. When a command modifes a fow graph h<sup>1</sup> to h2, our goal is to identify a (hopefully small) set of nodes Y in h<sup>1</sup> that are affected by this update, the *fow footprint*. That is, Y captures the difference between the fow graphs before and after the update and the complement of Y defnes the frame. To make this formal, we need the restriction of fow graphs to subsets of nodes, which then gives us a notion of fow graph decomposition. Towards this, consider h and Y ⊆ N. We defne

$$h|\_Y \triangleq \left( h.X \cap Y, h.E|\_{\left(h.X \cap Y\right) \times \mathbb{N}}, in \right)$$

such that the infow in satisfes in(z , y) h.in(z , y) for all z ∈ N\h.X , y ∈ h.X ∩Y and in(x , y) h.E(x,y)(h.flow(x )) for all x ∈ h.X \ Y , y ∈ h.X ∩ Y .

Defnition 2. *Consider* h<sup>1</sup> *and* h<sup>2</sup> *with* X h1.X = h2.X *and* h1.in = h2.in*. A* fow footprint *for the difference between* h<sup>1</sup> *and* h<sup>2</sup> *is a subset of nodes* Y ⊆ X *so that* h1|<sup>Y</sup> =ctx h2|<sup>Y</sup> *and* h1|<sup>X</sup> \<sup>Y</sup> = h2|<sup>X</sup> \<sup>Y</sup> *. The set of all such footprints is* FFP(h1, h2)*.*

Flow graphs over different sets of nodes or infows never have a fow footprint. The former requirement merely simplifes the presentation. To that end, we assume that all nodes that will be allocated during program execution are already present in the initial fow graph. This assumption can be lifted. The latter requirement is motivated by the fact that the global infow is part of the specifcation as noted earlier in Example 1.

Before we proceed with the problem of how to compute fow footprints, we highlight some of their properties.

Lemma 3 (Footprint Monotonicity). *If* Z ∈ FFP(h1, h2) *and* Z ⊆ Y ⊆ h1.X *, then* Y ∈ FFP(h1, h2)*.*

A consequence of monotonicity is the existence of a canonical fow footprint: if there is a fow footprint at all, then the set of all nodes will work as a footprint. Of course this canonical footprint is undesirably large. It corresponds to the case where one reasons about fow graph updates globally, forgoing the application of the frame rule. Unfortunately, an inclusion-minimal fow footprint does not exist.

Proposition 1 (Canonical Footprints). *We have:* FFP(h1, h2) = ∅ *if and only if* h1.X ∈ FFP(h1, h2)*. There is no inclusion-minimal fow footprint; in particular, the set* FFP(h1, h2) *is not closed under intersection.*

The proof of monotonicity requires a better understanding of the restriction operator, as provided by the following lemma.

Lemma 4 (Restriction). *Consider* h *and* Y , Z ⊆ N*. Then (i)* h|<sup>Y</sup> .flow = h.flow|<sup>Y</sup> *, (ii)* h|<sup>Y</sup> # h|<sup>X</sup> \<sup>Y</sup> *and* h|<sup>Y</sup> ∗ h|<sup>X</sup> \<sup>Y</sup> = h*, and (iii)* (h|<sup>Y</sup> )|<sup>Z</sup> = h|<sup>Y</sup> <sup>∩</sup><sup>Z</sup> *.*

Since fow footprints are defned via restriction, the lemma also shows that fow footprints are well-behaved. For example, the restriction to the footprint Y does not change the fow of a node y ∈ Y nor that of a node x ∈ h.X \ Y . More formally, this means h|<sup>Y</sup> .flow(y) = h.flow(y) and h|<sup>X</sup> \<sup>Y</sup> .flow(x ) = h.flow(x ), by Lemma 4(i).

For our development, it will be convenient to have a more operational formulation of the transfer function. Towards this, we understand the fow graph as a function that takes an infow as a parameter and yields a transformer of fow approximants:

$$h: \left( (\mathbb{N} \backslash X) \times X \to \mathbb{M} \right) \to (X \to \mathbb{M}) \to X \to \mathbb{M}$$

$$\text{defined by } \qquad h[in](\sigma)(x) = in\_x + rhs\_x(\sigma) \,.$$

Recall in<sup>x</sup> <sup>y</sup>∈N\<sup>X</sup> in(y, <sup>x</sup> ) and rhs<sup>x</sup> (σ) = <sup>y</sup>∈<sup>X</sup> <sup>E</sup>(y,x)(σ(y)). The least fxed point of h[in] is <sup>i</sup>∈<sup>N</sup> <sup>h</sup>[in] i (⊥) with <sup>h</sup><sup>0</sup> <sup>=</sup> id <sup>X</sup>→<sup>M</sup> and <sup>h</sup><sup>i</sup>+1 <sup>=</sup> <sup>h</sup><sup>i</sup> ◦ <sup>h</sup>, by Kleene's theorem. Defne out : (X →M)→X ×(N\X )→M by out(σ)(y, z ) E(y,<sup>z</sup> )(σ(y)). This yields the following characterization of transfer functions and fows.

Lemma 5 (Transfer). *For all fow graphs* h *we have (i)* tf (h) = out ◦ (lfp.h[−]) *and (ii)* lfp.h[h.in]) = h.flow*.*

## 4 Computing Footprints

We present an algorithm for computing a footprint for the difference between two given fow graphs. We proceed in two steps. We frst give a high-level description of the algorithm that ignores computability problems. In a second step, we show how to solve the computability problems. Throughout the development, we will assume to have fow graphs h<sup>1</sup> and h<sup>2</sup> over the same nodes X h1.X = h2.X and with the same infow h1.in = h2.in. If this assumption fails, a fow footprint does not exist by defnition.

#### 4.1 Algorithm

We compute the fow footprint as a fxed point. We start with the footprint candidate Z consisting of the nodes whose outgoing edges differ in h<sup>1</sup> and h2. Then, we iteratively add the nodes whose outfow leaving the current footprint candidate Z differs in h1|<sup>Z</sup> and h2|<sup>Z</sup> . That the outfow differs means that the transfer functions tf (h1|<sup>Z</sup> ) and tf (h2|<sup>Z</sup> ) differ and thus the candidate Z is not a footprint. In turn, if all outfows match, the transfer functions coincide and Z is a footprint as desired.

Technically, we compute the fxed point over the powerset lattice of nodes endowed with a distinguished top element: (P(X ), ) with P(X ) P(X ) {}. Element indicates a failure of the footprint computation. This may arise if the footprint is not covered by X , i.e., extends beyond the fow graphs h1, h2.

Our fxed point computation starts from Z = odif <sup>h</sup>1,h<sup>2</sup> ⊆ X as defned by

$$\operatorname{odif}\_{h\_1, h\_2} \triangleq \{ x \in X \mid \exists z \in \mathbb{N}. h\_1. E(x, z) \neq h\_2. E(x, z) \} \dots$$

The fxed point then proceeds to extend Z as long as the transfer functions associated with h1|<sup>Z</sup> and h2|<sup>Z</sup> do not match. To defne the extension, we let the *transfer failure* of Z ⊆ X be the successor nodes of Z that may receive different outfow from h<sup>1</sup> and h2:

$$\operatorname{tfail}\_{h\_1, h\_2}(Z) \triangleq \left\{ x \in \mathbb{N} \; \middle| \; Z \; \middle| \; \begin{array}{l} \exists \; in \leq h\_1 \vert\_{Z}. \; in \; \exists \; z \in Z. \\ \left[ \operatorname{tf}(h\_1 \vert\_{Z}) (in) \right] (z, x) \neq \left[ \operatorname{tf}(h\_2 \vert\_{Z}) (in) \right] (z, x) \end{array} \right\}.$$

This set is the *reason* why the current footprint candidate Z is not a footprint, that is, Z ∈/ FFP(h1, h2). Extending Z with the transfer failure yields a new candidate. We check that the new candidate is covered by X (i.e., does not include nodes outside of h1, h2). If the check fails, the new candidate is {} to indicate that no footprint could be computed. The following defnition makes the extension procedure precise.

Defnition 3. *The function* ext<sup>h</sup>1,h<sup>2</sup> : P(X ) → P(X ) *is defned by* ext<sup>h</sup>1,h<sup>2</sup> (Z) tfail<sup>h</sup>1,h<sup>2</sup> (Z) ⊆ X ? : Z odif <sup>h</sup>1,h<sup>2</sup> tfail<sup>h</sup>1,h<sup>2</sup> (Z) .

Iteratively extending the candidate Z with the transfer failure eventually produces a footprint for the difference of h<sup>1</sup> and h2, or fails with . The approach is sound.

Theorem 2 (Soundness). *Let* F lfp.ext<sup>h</sup>1,h<sup>2</sup> *. If* F =*, then* F ∈FFP(h1, h2)*.*

Figure 3. Computing a footprint for the difference of h and h iterates through the sets Z<sup>0</sup> { r }, Z<sup>1</sup> { r, u }, and Z<sup>2</sup> { r, u, v }. The latter is the least fxed point of exth,<sup>h</sup> and a footprint as desired, Z<sup>2</sup> ∈ FFP(h, h ).

*Example 2.* For an illustration consider Fig. 3. There, we apply the fxed point computation to fnd a footprint for the difference of h and h . As alluded to in Sect. 1, h is the result of inserting into h a new edge between nodes r and u labeled with λid .

The fxed point computation starts from Z<sup>0</sup> { r } = odif H,H as it is the only node whose outgoing edges have changed. Next, we compute tfailh,<sup>h</sup> (Z0). This yields { u } because u receives 0 from Z<sup>0</sup> in h but 1 in h due to the new edge. The outfow from Z<sup>0</sup> to the remaining nodes coincides in h and h . Hence, the extension of Z<sup>0</sup> with the transfer failure yields Z<sup>1</sup> exth,<sup>h</sup> (Z0) = { u, r }. Similarly, we compute tfailh,<sup>h</sup> (Z1) and obtain Z<sup>2</sup> exth,<sup>h</sup> (Z1) = { r, u, v }. Since v has no outgoing edges, Z<sup>2</sup> is the least fxed point of exth,<sup>h</sup> . Because Z<sup>2</sup> is a subset of the nodes of h and h , it is a footprint, Z<sup>2</sup> ∈ FFP(h, h ).

To obtain Theorem 2, we have to prove that the fxed point F lfp.ext<sup>h</sup>1,h<sup>2</sup> is indeed a footprint if F = . That is, we have to establish the following two properties according to Defnition 2: (i) h1|<sup>F</sup> =ctx h2|<sup>F</sup> and (ii) h1|<sup>X</sup> \<sup>F</sup> = h2|<sup>X</sup> \<sup>F</sup> .

To see the latter one, note that the graph structures (the nodes and edges) of h1|<sup>X</sup> \<sup>F</sup> and h2|<sup>X</sup> \<sup>F</sup> coincide because odif <sup>h</sup>1,h<sup>2</sup> ⊆ F. The infows coincide as well because they are, intuitively, comprised of the fow graph's overall infow h1.in = h2.in and the outfow of the footprint, which is equal in both fow graphs due to h1|<sup>F</sup> =ctx h2|<sup>F</sup> .

The interesting part of the soundness proof is to establish property (i), the contextual equivalence h1|<sup>F</sup> =ctx h2|<sup>F</sup> . Since F is a fxed point of ext<sup>h</sup>1,h<sup>2</sup> , we know that tfail<sup>h</sup>1,h<sup>2</sup> (Z) = ∅ and thus the transfer functions of h1|<sup>F</sup> and h2|<sup>F</sup> coincide. Hence, it suffces to establish h1|<sup>F</sup> .in = h2|<sup>F</sup> .in to obtain the desired contextual equivalence, Defnition 1. This key step in the proof is obtained with the help of the following lemma.

## Lemma 6. *Let* odif <sup>h</sup>1,h<sup>2</sup> ⊆F ⊆X *with* tfail<sup>h</sup>1,h<sup>2</sup> (F)=∅*. Then* h1|<sup>F</sup> .in =h2|<sup>F</sup> .in*.*

To establish the lemma one has to show that the infow into F from the non-footprint part Y X \F coincides in h<sup>1</sup> and h2. The challenge is a cyclic dependency in the fow: the infow from Y depends on the outfow of F, which depends on the infow from Y. To tackle this, we rephrase the fow equation for h<sup>i</sup> as a pairing of the two separate fow equations for hi|<sup>F</sup> and hi|<sup>Y</sup> , for i ∈ { 1, 2 }. Intuitively, the pairings compute the fow locally in hi|<sup>F</sup> and hi|<sup>Y</sup> for a fxed infow (initially hi.in). Then, the infow to hi|<sup>F</sup>

Figure 4. Counterexample to completeness using the monoid (N∪{∞}, max, 0). While the set {x, y, z, u} is a footprint for the difference between fow graphs h<sup>1</sup> and h2, our fxed point will produce the candidates {x} and Z {x, y, z} and then fail with {}.

is updated to the infow from outside h<sup>i</sup> and the infow from hi|<sup>Y</sup> , and similarly for the infow to hi|<sup>Y</sup> . This is repeated until a fxed point is reach. Technically, we rely on Bekic's Lemma [ ´ 1] to compute the pairings. Then, we observe tf (h1|<sup>F</sup> ) = tf (h2|<sup>F</sup> ) because tfail<sup>h</sup>1,h<sup>2</sup> (F) = ∅ as well as tf (h1|<sup>Y</sup> ) = tf (h2|<sup>Y</sup> ) because odif <sup>h</sup>1,h<sup>2</sup> ⊆ F. Roughly, this means that the fow pairings for h<sup>1</sup> and h<sup>2</sup> must coincide as the individual parts propagate the same values. Put differently, the updated infow for h1|<sup>F</sup> and h2|<sup>F</sup> as well as h1|<sup>Y</sup> and h2|<sup>Y</sup> coincide in each iteration. Overall, we get h1|<sup>F</sup> .in = h2|<sup>F</sup> .in.

Our computation of a fow footprint is forward, it starts from the nodes where the fow graphs differ and follows the edges. It may therefore fail if predecessor nodes of an iterate Z need to be considered to determine a fow footprint. For an example refer to Fig. 4. Using the monoid (N∪{∞}, max, 0), it is easy to see that the set { x, y, z, u } is a footprint for the difference between h<sup>1</sup> and h2. Our fxed point, however, will start with { x } and extend this to Z { x, y, z }. Let v be the node outside the fow graphs that y is pointing to. Then, the next transfer failure is tfail<sup>h</sup>1,h<sup>2</sup> (Z) = { v } because for in < k the outfow of y to v differs in h1|<sup>Z</sup> and h2|<sup>Z</sup> . Our approach fails to compute a footprint.

Fact 3 (Incompleteness) *There are fow graphs* h<sup>1</sup> *and* h<sup>2</sup> *for which our algorithm is not able to determine a fow footprint although one exists.*

#### 4.2 Comparing Transfer Functions

When implementing the above fxed point computation, the challenge is to prove the equivalence between given transfer functions in order to obtain the transfer failure: [tf (h1|<sup>Z</sup> )(−)](−, x )=[tf (h2|<sup>Z</sup> )(−)](−, x )? Already the comparison of two functions is known to be diffcult to do algorithmically. What adds to the problem is that transfer functions are defned as least fxed points, meaning we do not have a closed-form representation of the functions to compare.

Our approach is to impose additional requirements on the set of edge functions. The requirements are met in all our experiments, and so do not mean a limitation for the applicability of our approach. We show that if the edge functions are not only continuous but also distributive, then the transfer functions can be understood in terms of paths through the underlying fow graphs. If the edge functions are additionally decreasing and the underlying monoid's addition is idempotent, then acyclic paths are suffcient. Both results do not hold for merely continuous edge functions.

Distributivity. Our frst additional assumption is that the edge functions f : M → M are not only continuous, but also *distributive* in that f(m + n) = f(m) + f(n) for all m, n ∈ M and f(0) = 0. We use DistFun(M) to refer to the set of all continuous and distributive functions over M. The properties formulated in Lemma 1 carry over.

For continuous and distributive transfer functions, we can understand h[in] <sup>i</sup> in terms of the paths through h[in] of length i. For example, i = 3 yields

$$\begin{aligned} [h[in]^3](\bot)(z) &= \dot{m}\_z + \sum\_{y \in X} E\_{(y,z)}(\, in\_y + \sum\_{x \in X} E\_{(x,y)}(\, in\_x + \sum\_{u \in X} E\_{(u,x)}(\bot(u)) \, ) \\ &= \dot{m}\_z + \sum\_{y \in X} E\_{(y,z)}(\, in\_y) + \sum\_{y \in X} \sum\_{x \in X} E\_{(y,z)}(E\_{(x,y)}(\, in\_x)) \, . \end{aligned}$$

The frst equality is by defnition, the second is where distributivity comes in. In particular, ⊥(u)=0 and so E(y,<sup>z</sup> )( E(x,y)( E(u,x)( ⊥(u)))=0. The last term shows that we forward the infow given at a node x to an intermediary node y and from there to the node z of interest. For higher powers of h[in], we take longer paths. For h[in] <sup>∗</sup>, we thus obtain the sum over all nodes x and all paths from x to z through the fow graph. We need some defnitions to make this precise.

A *path* p through fow graph h is a fnite, non-empty sequence of nodes all of which belong to the fow graph except the last which lies outside:

$$p \;=\; x\_0 \cdot \dots \cdot x\_n \cdot z \; \in \; X^+ \cdot (\mathbb{N} \; \backslash X)^+$$

where · denotes path concatenation. We use first(p) = x<sup>0</sup> resp. last(p) = x<sup>n</sup> to extract the frst resp. last node from within the fow graph h. By Paths(h, x , y, z ) we denote the set of all paths through fow graph h that start in node first(p) = x and leave h from node last(p) = y to move to z ∈ N \ X . Given a set of nodes X ⊆ X , we use Paths(h, X , y, z ) for the union over all x ∈ X of the sets Paths(h, x , y, z ). The path induces the function E<sup>p</sup> : M → M that composes the edge functions along the path:

$$E\_x = id \qquad \qquad \qquad E\_{x,p} = E\_p \diamond E\_{(x, first(p))} \cdot \cdot$$

Together with Lemma 5, the above analysis yields the frst closed-form representation of a fow graph's transfer function, which so far has involved a fxed point computation.

#### Theorem 4 (Closed-Form Representation). *If* h *is labeled over* DistFun(M)*, then:*

$$[tf(h)(in)](y,z) \quad = \sum\_{x \in X} \sum\_{p \in Pash(h,x,y,z)} E\_p(in\_x) \dots$$

Theorem 4 pushes the fxed point computation of transfer functions into the sets Paths(h, x , y, z ) which are themselves defned inductively and potentially infnite. In the following, we alleviate this problem without requiring acyclicity of the fow graph.

Idempotence. Our second assumption is that addition in the monoid is idempotent, meaning m + m = m for all m ∈ M. Idempotence ensures the addition degenerates to a join for comparable elements: m +n =m n =n for all m ≤ n ∈ M. Unless stated otherwise, we hereafter assume an idempotent addition.

With Theorem 4, it remains to compare sums over paths. With idempotence, we show that we can further reduce the problem and reason over single paths rather than sums. We show that every path in h<sup>1</sup> can be replaced by a set of paths in h2, and vice versa. Even more, we only have to consider the paths from nodes where the edges changed. The precise formulation of the path replacement condition is the following.

Defnition 4. *The* path replacement condition *for fow graphs* h<sup>1</sup> *by* h<sup>2</sup> *over the same set of nodes* X *and labeled by* DistDecFun(M) *requires that for every* x ∈ odif <sup>h</sup>1,h<sup>2</sup> *, for every* y ∈ X *, and for every* z ∈ N \ X *we have*

<sup>∀</sup> <sup>p</sup> <sup>∈</sup> Paths(h1, <sup>x</sup> , <sup>y</sup>, <sup>z</sup> ) <sup>∃</sup><sup>P</sup> <sup>⊆</sup> Paths(h2, <sup>x</sup> , <sup>y</sup>, <sup>z</sup> ). <sup>E</sup><sup>p</sup> <sup>≤</sup> <sup>E</sup><sup>P</sup> <sup>q</sup>∈<sup>P</sup> <sup>E</sup><sup>q</sup> .

*Example 3.* For the fow graphs h<sup>1</sup> and h<sup>2</sup> from Fig. 4, we have path replacement of h<sup>1</sup> by h2, and vice versa. To see this, consider the path p x · z · u · y · v in h<sup>1</sup> and q x · y · v in h2, where v is the node outside of h1, h<sup>2</sup> that y points to. Since all edges are labeled with λid , we have E<sup>p</sup> = λid = Eq. It is worth noting that, in this example, we can ignore the cycles in h<sup>1</sup> and h2. In a moment, we will introduce restrictions on edge functions in order to do avoid cycles in general.

Similarly, we have path replacement for the fow graphs from Fig. 2. To be precise, E<sup>p</sup> = λ<sup>8</sup> = E<sup>q</sup> for the paths p l · t · r · v in h<sup>1</sup> and q l · r · v in h2.

The main result is that path replacement is sound and complete for proving equivalence of transfer functions.

Theorem 5 (Path Replacement Principle). *We have* tf (h1) = tf (h2) *if and only if path replacement of* h<sup>1</sup> *by* h<sup>2</sup> *and of* h<sup>2</sup> *by* h<sup>1</sup> *hold.*

The theorem is remarkable in several respects. First, one would expect we have to replace the paths from all nodes in h1. Instead, we can focus on the nodes where the outgoing edges changed. Second, one would expect the replacing paths P start from arbitrary nodes in h2. Such a set of paths would yield a transfer function of type (Y →M)→M. Instead, we can work with a function of type M→M. Even more, we can focus on paths starting in the same node as the path we intend to replace. Finally, the paths we use for replacement come without any constraints, leaving room for heuristics.

The proof starts from a *full path replacement condition* of h<sup>1</sup> by h2, both over X and labeled by DistFun(M). Full path replacement coincides with Defnition 4 but draws x from full X rather than x ∈ odif <sup>h</sup>1,h<sup>2</sup> . Full path replacement characterizes equivalence of the transfer functions in a monoid with idempotent addition in the case of continuous and distributive edge functions.

#### Lemma 7. *Full path replacement of* h<sup>1</sup> *by* h<sup>2</sup> *and* h<sup>2</sup> *by* h<sup>1</sup> *hold iff* tf (h1) = tf (h2)*.*

The result is a consequence of Theorem 4, which equates tf (h1) with the sum of the E<sup>p</sup> for all paths p ∈ Paths(h1, x , y, z ) for all x ∈ X . Full path replacement allows us to sum over E<sup>P</sup> instead, for some P ⊆ Paths(h2, x , y, z ). Over-approximating P with all paths Paths(h2, x , y, z ), we obtain an upper bound for tf (h1). It is easy to see that the resulting sum can be rewritten into the form of Theorem 4, yielding tf (h1) ≤ tf (h2). Analogously, we get tf (h1) ≥ tf (h2) and thus tf (h1) = tf (h2) as required. The reverse direction of the lemma is similar.

To conclude the proof of the path replacement principle in Theorem 5, we show that full path replacement and (ordinary) path replacement of h<sup>1</sup> by h<sup>2</sup> coincide. To see this, consider a path p ∈ Paths(h1, x , y, z ) for any x ∈ X . The goal is to show E<sup>p</sup> ≤ E<sup>P</sup> for some P ∈ Paths(h2, x , y, z ). To that end, decompose the path into p = p<sup>1</sup> ·p<sup>2</sup> such that x first(p2) is the frst node in p from odif <sup>h</sup>1,h<sup>2</sup> . Ordinary path replacement yields Q ∈ Paths(h2, x , y, z ) with Ep<sup>2</sup> ≤ E<sup>Q</sup> . Now, choose P { p1·q | q ∈ Q }. Because p<sup>1</sup> exists in h<sup>1</sup> and h<sup>2</sup> with the exact same edge labels, we obtain the desired E<sup>p</sup> ≤ E<sup>P</sup> .

Lemma 8. *Full path replacement of* h<sup>1</sup> *by* h<sup>2</sup> *holds if and only if path replacement of* h<sup>1</sup> *by* h<sup>2</sup> *holds.*

Decreasingness. We assume that the edge functions f : M → M are not only continuous and distributive, but also *decreasing*: f(m) ≤ m for all m ∈ M. The assumption of decreasing edge functions is justifed by the fact that a program that traverses the fow graph builds up information about the status of the structure, and smaller fow values mean more information (as in classical data fow analysis). We use DistDecFun(M) to refer to the set of all continuous, distributive, and decreasing transfer functions over M; Lemma 1 carries over to this set. Addition in the monoid is still assumed idempotent.

If all edge functions are decreasing, every cycle in the fow graph is decreasing as well. The key observation is that, given an idempotent addition, cycles with decreasing edge functions can be avoided when forming sums over sets of paths.

Lemma 9. *Let* h *be labeled over* DistDecFun(M) *and* p<sup>1</sup> · p · p<sup>2</sup> ∈ Paths(h, x , y, z ) *with* last(p) = first(p)*. Then* p<sup>1</sup> · p<sup>2</sup> ∈ Paths(h, x , y, z ) *and* E<sup>p</sup>1·p·p<sup>2</sup> ≤ E<sup>p</sup>1·p<sup>2</sup> *.*

Call a path *simple* if it does not repeat a node and let SimplePaths(h, x , y, z ) denote the set of all simple paths through h from x to y and leaving the fow graph towards z . Note that a fnite graph only admits fnitely many simple paths.

Theorem 6 (Simple Paths). *Assuming continuous, distributive, and decreasing edge functions, and assuming idempotent addition, Theorem 4 and Theorem 5 hold with every occurrency of* Paths(h, x , y, z ) *replaced by* SimplePaths(h, x , y, z )*.*

In practice, path-counting fows, keyset fows, reachability fows, shortest-path fows, and priority inheritance fows are relevant [22–24, 27] and compatible with our theory.

## 5 Evaluation

We substantiate the practicality of our new approach by evaluating it on a real-world collection of fow graphs extracted from the literature. We explain how we obtained our benchmarks and how we implemented and evaluated our approach.

Benchmark Suite. As alluded to in Sect. 1, the fow framework has been used to verify complex concurrent data structures. More specifcally, it has been used for automated proof construction by the plankton tool [26, 27]. plankton performs an exhaustive proof search over a separation logic with support for fows—and further advanced features for establishing linearizability that do not matter for the present evaluation. In order to handle heap updates, plankton generates a footprint h for the fow graph h<sup>1</sup> = h ∗ hframe of the current proof state (represented as an assertion in separation logic). It then frames the non-footprint part hframe of the fow graph h<sup>1</sup> to compute the post state h of the heap update locally for the footprint h. The result is the new fow graph h<sup>2</sup> = h ∗ hframe . We consider the pair (h1, h2) a *benchmark* for our evaluation.

We adapt plankton to export the fow graph pairs for which a footprint is constructed. This way, we obtain 1272 benchmarks from the heap updates occurring during proof construction for a collection of 10 concurrent set data structures. All fow graphs in this benchmark suite contain at most 4 nodes.

Our benchmark suite is limited by the capabilities and restrictions of plankton. In particular, we inherit the confnement to concurrent search structures. This is due to the fact that plankton integrates support only for the keyset fow (cf. Example 1). Our evaluation will compute footprints with respect to this fow.

Implementation. We implement the fxed point computation to fnd footprints for two given fow graphs h1, h<sup>2</sup> from Sect. 4 in a tool called krill [28]. It integrates three methods for computing the transfer failure tfail<sup>h</sup>1,h<sup>2</sup> (Z) of a footprint candidate Z:


Our benchmark suite satisfes the requirements for all three methods. The NAIVE and DIST methods include a (suffcient) check to ensure acyclicity in the updated fow graph to guarantee soundness of the resulting footprint.

All three methods encode the necessary equivalence checks among transfer functions as SMT formulas which are then discharged using the off-the-shelf SMT solver Z3 [31]. Our encodings use the theory of integers with quantifers. The NAIVE method additionally uses free functions to encode sets of integers.

Experiments. We ran krill on our benchmark suite and compared the runtime of the three different methods for computing the transfer failure. Our results are summarized in Fig. 5(left). For every search structure that we extracted benchmarks from, the fgure lists: (i) the number #FG of fow graph pairs extracted, (ii) each method's total runtime for computing the footprints of all fow graph pairs, and (iii) the speedup of NEW over NAIVE in percent. The experiments were conducted on an Apple M1 Pro.

Figure 5(left) shows that the runtime for all methods is roughly linear in the number of computed footprints. Moreover, the absolute time for computing footprints is small, making the approaches practical. The fgure also shows that our NEW and DIST methods have a performance advantage over the NAIVE method. The NEW method is between 22% and 39% faster than the NAIVE method. We believe that the difference is relatively small only because the acyclicity assumption avoids a potentially non-terminating fxed point computation. Avoiding this fxed point in the presence of cycles is a major advantage that our NEW method has over the NAIVE and DIST methods. The performance difference for DIST and NEW are negligible because the acyclicity check is negligible.


Figure 5. Experimental results averaged over 1000 repeated runs, conducted on an Apple M1 Pro. (left) Total runtime for computing footprints for fow graphs occurring during automated proof construction for highly concurrent set data structures. The speedup gives the relative performance improvement of NEW over NAIVE. (right) Average runtime for computing a single footprint, partitioned by footprint size ( indicates failure).

We also factorized the runtimes of our benchmarks along the size of the resulting footprint. Figure 5(right) gives the average runtime and standard deviation for computing a single footprint, broken down by footprint size. If no footprint could be found, its size is listed as . These failed footprint constructions are consistent with plankton's method and would not lead to verifcation failure.

## 6 Related Work

Two alternative meta theories for the fow framework have been proposed in prior work [23, 24]. Like in our setup, the original fow framework [23] demands that the fow domain is an ω-cpo to obtain a least fxed point semantics. However, it proposes a different fow graph composition that leads to a notion of contextual equivalence relying on infow equivalence classes. This complicates proof automation. In addition, the fow domain is assumed to be a semiring and edge functions are restricted to multiplication with a constant. This limits expressivity.

As discussed in Sect. 1, the revised fow framework proposed in [24] requires that the fow monoid is cancellative but not an ω-cpo. This means that uniqueness of fows is not guaranteed per se. Instead, uniqueness is obtained by imposing additional conditions on the edge functions. However, these conditions are more restrictive than those imposed in our framework. The *capacity* of a fow graph introduced in [24] closely relates to our notion of transfer function. A closed-form representation based on sums over paths is used to check equivalence of capacities. However, this reasoning is restricted to acyclic graphs. Also, [24] provides no algorithm for computing fow footprints.

In a sense, our work strikes a balance between the two prior meta theories by guaranteeing unique fows without sacrifcing expressivity and, at the same time, enabling better proof automation. That said, we believe that the framework proposed in [24] remains of independent interest, in particular if the application does not require unique fows (i.e., does not impose lower bounds on fows that may trivially hold in the presence of vanishing fows). Cancellativity allows one to aggregate infows and outfows to unary functions, which can lead to smaller fow footprints (i.e., more local proofs).

The benchmark suite for our evaluation is obtained from plankton [26,27], a tool for verifying concurrent search structures using keyset fows. When the program mutates the symbolic heap, plankton creates a fow graph for the mutated nodes plus all nodes with a distance of k or less from those nodes. This fow graph is considered to be the footprint and contextual equivalence is checked. The check is basically the same as for NAIVE. However, the paper does not present the meta theory for the underlying notion of fow graphs, nor does it provide any justifcation for the correctness of the implemented algorithms used to reason about fow graphs.

Flow graphs form a separation algebra. Hence, the developed theory can be used in combination with any existing separation logic that is parametric in the underlying separation algebra such as [5, 7, 18, 27, 41, 44]. Identifying footprints of updates relates to the frame inference problem in separation logic, which has been studied extensively [4, 6, 15, 25, 35, 36, 42]. However, existing work focuses on frame inference for assertions that are expressed in terms of inductive predicates. These techniques are not well-suited for reasoning about programs manipulating general graphs, including overlayed structures, which are often used in practice and easily expressed using fows. A common approach to reason about general heap graphs in separation logic is to use iterated separating conjunction [14, 39, 44, 47] to abstract the heap by a *pure* graph that does not depend on the program state. Though, the verifcation of specifcations that rely on inductive properties of the pure graph then resorts back to classical frst-order reasoning and is diffcult to automate. An exception is [45] which uses SMT solvers to frame binary reachability relations in graphs that are described by iterated separating conjunctions. However, the technique is restricted to such reachability properties only.

Unbounded footprints have been encountered early on when computing the post image for recursive predicates [8]. This has spawned interest in separation logic fragments for which the reasoning can be effciently automated [2,3,9,17,20,35,38]. A limitation that underlies all these works is an assumption of tree-regularity of the heap, in one way or another, which fows have been designed to overcome. In cases where the program (or ghost code) traverses the unbounded footprint (before or after the update), recent AQ2 works [24, 27] have found a way to reduce the reasoning to bounded footprint chunks.

> The defnition of a fow closely resembles the classical formulation of a forward data fow analysis. The fact that the least fxed point of the fow equation for distributive edge functions can be characterized as a join over all paths in the fow graph mirrors dual results for greatest fxed points in data fow analysis [19,21]. In a similar vein, the notion of contextual equivalence of fow graphs relates to contextual program equivalence and fully abstract models in denotational semantics [16,30,37]. In fact, Bekic's Lemma [ ´ 1], which we use in the proofs of Theorem 1 and lemma 6, was originally motivated by the study of such models. Flow graphs can serve as abstractions of programs (rather than just program states). We therefore believe that our results could also be of interest for developing incremental and compositional data fow analysis frameworks.

## Data Availability Statement

The krill artifact and dataset generated and/or analysed in the present paper are available in the Zenodo repository [28], https://zenodo.org/record/7566204.

## Acknowledgments

This work is funded in part by NSF grant 1815633. The frst author was supported by the DFG project *EDS@SYN: Effective Denotational Semantics for Synthesis*. The third author is supported by a Junior Fellowship from the Simons Foundation (855328, SW).

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **ALASCA: Reasoning in Quantifed Linear Arithmetic**

Konstantin Korovin<sup>3</sup> , Laura Kovács<sup>1</sup> , Giles Reger<sup>3</sup> , Johannes Schoisswohl1() , and Andrei Voronkov<sup>2</sup>*,*<sup>3</sup>

> TU Wien, Vienna, Austria johannes.schoisswohl@tuwien.ac.at EasyChair, Manchester, UK University of Manchester, Manchester, UK

**Abstract.** Automated reasoning is routinely used in the rigorous construction and analysis of complex systems. Among diferent theories, arithmetic stands out as one of the most frequently used and at the same time one of the most challenging in the presence of quantifers and uninterpreted function symbols. First-order theorem provers perform very well on quantifed problems due to the efcient superposition calculus, but support for arithmetic reasoning is limited to heuristic axioms. In this paper, we introduce the Alasca calculus that lifts superposition reasoning to the linear arithmetic domain. We show that Alasca is both sound and complete with respect to an axiomatisation of linear arithmetic. We implemented and evaluated Alasca using the Vampire theorem prover, solving many more challenging problems compared to state-of-the-art reasoners.

**Keywords:** Automated Reasoning · Linear Arithmetic · SMT · Quantifed First-Order Logic · Theorem Proving

## **1 Introduction**

Automated reasoning is undergoing a rapid development thanks to its successful use, for example, in mathematical theory formalisation [15], formal verifcation [16] and web security [13]. The use of automated reasoning in these areas is mostly driven by the application of SMT solving for quantifer-free formulas [6, 12, 29]. However, there exist many use case scenarios, such as expressing arithmetic operations over memory allocation and fnancial transactions [1, 18, 20, 32], which require complex frst-order quantifcation. SMT solvers handle quantifers using heuristic instantiation in domain-specifc model construction [10, 28, 30, 36]. While being incomplete in most cases, instantiation requires instances to be produced to perform reasoning, which can lead to an explosion in work required for quantifer-heavy problems. What is rather needed to address the above use cases is a reasoning approach able to handle both theories and complex applications of quantifers. Our work tackles this challenge and designs a *practical, low-cost methodology* for proving frst-order quantifed linear arithmetic properties.

The problem of combining quantifers with theories, and especially with arithmetic, is recognised as a major challenge in both SMT and frst-order proving communities. In this paper *we focus on frst-order, i.e. quantifed, reasoning with linear arithmetic and uninterpreted functions*. In [26], it is shown that the validity problem for frst-order reasoning with linear arithmetic and uninterpreted functions is *Π*<sup>1</sup> 1 -complete even when quantifers are restricted to non-theory sorts. Therefore, there is no sound and complete calculus for this logic.

**Quantifed Reasoning in Linear Arithmetic – Related Works.** In practice, there are two classes of methods of reasoning in frst-order theory reasoning, and in particular with linear real arithmetic. SMT solvers use *instance-based methods*, where they repeatedly generate ground, that is quantifer-free, instances of quantifed formulas and use decision procedures to check satisfability of the resulting set of ground formulas [10, 28, 36]. Superposition-based frst-order theorem provers use *saturation algorithms* [14, 27, 37]. In essense, they start with an initial set of clauses obtained by preprocessing the input formulas (initial search space) and repeatedly apply inference rules (such as superposition) to clauses in the search space, adding their (generally, non-ground) consequences to the search space. These two classes of methods are very diferent in nature and complement each other.

The superposition calculus [4, 31] is a refutationally complete calculus for frstorder logic with equality that is used by modern frst-order provers, for example, Vampire [27], E [37], iProver [17] and Zipperposition [14]. There have been a number of practical extensions to this calculus for reasoning in frst-order theories, in particular for linear arithmetic [9, 11, 24]. Superposition theorem provers have become efcient and powerful on theory reasoning after the introduction of the AVATAR architecture [33, 38], which allows generated ground clauses to be passed to SMT solvers. Yet, superposition theorem provers have a major source of inefciency. To work with theories, one has to add *theory axioms*, for example the transitivity of inequality ∀*x*∀*y*∀*z*(*x* ≤ *y* ∧ *y* ≤ *z* → *x* ≤ *z*). In clausal form, this formula becomes ¬*x* ≤ *y* ∨ ¬*y* ≤ *z* ∨ *x* ≤ *z* where ¬*x* ≤ *y* can be resolved against *every* clause in which an inequality literal *s* ≤ *t* is selected. This, with other prolifc theory axioms, results in a very signifcant growth of the search space. Note that SMT solvers do not use and do not need such theory axioms.

A natural solution is to try to eliminate some theory axioms, but this is notoriously difcult both in theory and in practice. In [26], the Lasca calculus was proposed, which replaced several theory axioms of linear arithmetic, including transitivity of inequality, by a new inference rule inspired by Fourier-Motzkin elimination and some additional rules. Lasca was shown to be complete for the ground case. But, after 15 years, Lasca is still not implemented, due to its complexity and lack of clear treatment for the non-ground case. As we argue in Sect. 5, lifting Lasca to the non-ground setting is nearly impossible as a non-ground extension of the underlining ordering is missing in [26].

**Lifting Lasca to Alasca– Our contributions***.* In this paper we introduce a new non-ground version of Lasca, which we call Abstracting Lasca (Alasca). Our Alasca calculus comes with new abstraction mechanisms (Sect. 4), inference

rules and orderings (Sect. 5), which all together are proved to yield a sound and complete approach with respect to a natural partial axiomatisation of linear arithmetic (Theorem 5) 4 . In a nutshell, we make Alasca both work and scale by introducing (i) a novel variable elimination rule within saturation-based proof search (Fig. 3b); (ii) an analogue of *unifcation with abstraction* [34] needed for non-ground reasoning (Sect. 4); and (iii) a new non-ground ordering and powerful background theory for unifcation, which is not restricted to arithmetic but can be used with arbitrary theories (Sect. 5). As a result, Alasca improves [26] by ground modifcations and lifting of Lasca in a fnitary way, and complements [3, 40] with variable elimination rules that are competible with standard saturation algorithms. We also *demonstrate the practicality and efciency* of Alasca (Sect. 6). To this end, we implemented Alasca in Vampire and show that it solves overall more problems than existing theorem provers.

## **2 Motivating Example**

Consider the following mathematical property:

$$\forall x, y. \left( f(2x, y) > 2x + y \lor f(x+1, y) > x + 2y \right) \to \forall x. \exists y. f(2, y) > x \tag{1}$$

where *f* is an uninterpreted function. While property (1) holds, deriving its validity is hard for state-of-the-art reasoners: only veriT [2] can solve it. Despite its seeming simplicity, this problem requires non-trivial handling of quantifers and arithmetic. Namely, one would need to unify (modulo theory) the terms 2*x* and *x* + 1 (which can be done by instantiating *x* with 1) and then derive *f*(2*, y*) *>* 2 + *y* ∨ *f*(2*, y*) *>* 1 + 2*y*. Further, one also needs to prove that *f*(2*, y*) is always greater than the minimum of 2 + *y* and 1 + 2*y*, for arbitrary *y*.

Vampire with Alasca fnds a remarkably short proof as shown in Fig. 1. To prove (1) its negation is shown unsatisfable by frst negating and translating into clausal form (by using skolemization and normalisation, which shifts arithmetic terms to be compared to 0), as listed in lines 1–4. Next a lower bound for *f*(2*x, y*) is established: In line 5, using our new inequality factoring (IF) rule with unifcation with abstraction (see Fig. 3a), the constraint 2*x* ̸≈ *x* + 1 is introduced, and establishing thereby that if 2*x* ≈ 1+*x* and *y* + 2*x* ≤ 2*y* +*x*, then *f*(2*x, y*) *>* 2*x*+*y*. After further normalisation, the inequalities *sk* ≥ *f*(2*, y*) and *f*(2*x, y*) *>* 2*x* + *y* are used to derive *sk >* 2*x* + *y* in line 7, using the Fourier-Motzkin Elimination rule (FM), while still keeping track of the constraint 2*x* ̸≈ *x* + 1. By applying the Variable Elimination rule (VE) twice, the empty clause □ is derived in line 10, showing the unsatisfability of the negation of (1).

The key steps in the proof (and the reason why it was found in a short time) are: (1) the use of the theory rules (FM), and (IF); (2) the use of the new variable elimination rule (VE), and fnally, a consistent use of unifcation with abstraction. These rules give a signifcant reduction compared to the number of steps required using theory axioms. In particular, not using (FM) would require the use of transitivity and generation of several intermediate clauses. As well as shortening

<sup>4</sup> proofs and further details of our results can be found in [23]

*. f*(2*x, y*) *>* 2*x* + *y* ∨ *f*(*x* + 1*, y*) *> x* + 2*y* Hypothesis *.* ¬*f*(2*, y*) *> sk* Skolemized, Neg. Conj. *. f*(2*x, y*) − 2*x* − *y >* 0 ∨ *f*(*x* + 1*, y*) − *x* − 2*y >* 0 Normalisation 1 *.* −*f*(2*, y*) + *sk* ≥ 0 Normalisation 2 *. f*(2*x, y*) − 2*x* − *y >* 0 ∨ *y* + 2*x* − 2*y* − *x >* 0 ∨ 2*x* ̸≈ *x* + 1 (IF) 3 *. f*(2*x, y*) − 2*x* − *y >* 0 ∨ *x* − *y >* 0 ∨ 0 ̸≈ *x* − 1 Normalisation 5 *.* −2*x* − *y* + *sk >* 0 ∨ *x* − *y >* 0 ∨ 0 ̸≈ *x* − 1 ∨ 2*x* ̸≈ 2 (FM) 6,4 *.* −2*x* − *y* + *sk >* 0 ∨ *x* − *y >* 0 ∨ 0 ̸≈ *x* − 1 Normalisation 7 *.* 0 ̸≈ *x* − 1 (VE) 8 *.* □ (VE) 9

**Fig. 1.** A refutational proof using the calculus introduced in this paper. Variables *x, y* are implicitly universally quantifed, and *sk* is an uninterpreted constant.

the proof, we eliminate the fatal impact on proof search from generating a large number of irrellevant formulas from theory axioms.

Indeed, such short proofs are also found quickly. Similar our previous example, ∀*x, y. f*(*g*(*x*)+*g*(*a*)*, y*) *>* 2*x*+*y*∨*f*(2*g*(*x*)*, y*) *> x*+2*y* → ∃*k.*∀*x*∃*z.f*(2*g*(*k*)*, z*) *> x* has a short proof of 7 steps, excluding CNF transformation and normalisation steps, found by Vampire with Alasca. This proof was found in almost no time (only 37 clauses were generated) but cannot be solved by any other solver. This shows the power of the calculus.

## **3 Background and Notation**

*Multi-Sorted First-Order Logic.* We assume familiarity with standard frst-order logic with equality, with all standard boolean connectives and quantifers in the language. We consider a multi-sorted frst-order language, with sorts *τ*Q*, τ*1*, . . . , τn*. The sort *τ*<sup>Q</sup> is the *sort of rationals*, whereas *τ*1*, . . . , τ<sup>n</sup>* are *uninterpreted sorts*. We write ≈*<sup>τ</sup>* for the equality predicate of *τ*. We denote the set of all terms as **T**, variables as **V**, and literals as **L**. Throughout this paper, we denote terms by *s, t, u*, variables by *x, y, z*, function symbols by *f, g, h*, all possibly with indices. Given a term *t* such that *t* is *f*(*. . .*), we write sym(*t*) for *f*, referring that *f* is the top level symbol of *t*. We write *t* : *τ* to denote that *t* is a term of sort *τ*. A term, or literal is called *ground*, when it does not contain any variables. We refer to the sets of all ground terms, and literals as **T***<sup>θ</sup>* , and **L** *θ* respectively.

We denote predicates by *P, Q*, literals by *L*, clauses by *C, D*, formulas by *F, G*, and sets of formulas (axioms) by E, possibly with indices. We write *F* |= *G* to denote that whenever *F* holds in a model, then *G* does as well. We call a function (similarly, for predicates) *f* uninterpreted wrt some set of equations E if whenever E |= *f*(*s*<sup>1</sup> *. . . sn*) ≈ *f*(*t*<sup>1</sup> *. . . tn*), then E |= *s*<sup>1</sup> ≈ *t*<sup>1</sup> ∧ *. . .* ∧ *s<sup>n</sup>* ≈ *tn*. A function *f* is interpreted wrt E if it is not uninterpreted.

*Rational Sort.* We assume the signature contains a countable set of unary functions *k* : *τ*<sup>Q</sup> 7→ *τ*<sup>Q</sup> for every *k* ∈ Q and refer to *k* as *numeral multiplications*. In addition, the signature is assumed to also contain a constant 1 : *τ*Q, a function

+ : *τ*<sup>Q</sup> × *τ*<sup>Q</sup> 7→ *τ*Q, and predicate symbols *>,* ≥: **P**(*τ*<sup>Q</sup> × *τ*Q), as well as an arbitrary number of other function symbols. For every numeral multiplication *k* ∈ Q \ {1}, we simply write *k* to denote the term *k*(1) obtained by the numeral multiplication *k* applied to 1; in these cases, we refer to *k* as *numerals*. Throughout this paper, we use *j, k, l* to denote numerals, or numeral multiplications, possibly with indices.

We write −*t* to denote the term −1(*t*). If *j, k* are two numeral multiplications, by (*jk*) and (*j* + *k*) we denote the numeral multiplication that corresponds to the result of multiplying and adding the rationals/numerals *j* and *k*, respectively. For applications of numeral multiplications *j*(*t*) we may omit the parenthesis and write *jt* instead. If we write +*k*, or −*k* for some numeral *k*, we assume *k* itself is positive. We write ± (and ∓) to denote either of the symbols + or − (and respectively − or +). For *q* ∈ Q we defne **sign**(*q*) to be 1 if *q >* 0, −1 if *q <* 0, and 0 otherwise. We call +, ≥, *>*, 1, and the numeral multiplications the Q *symbols*. Finally, an *atomic term* is either a logical variable, or the term 1, or a term whose top level function symbol is not a Q symbol.

A Q*-model* interprets the sort *τ*<sup>Q</sup> as Q, and all Q symbols as their corresponding functions/predicates on Q. We write Q |= *C* if for every Q-model *M*, *M* |= *C* holds. If E is a set of formulas, we call a model *M* a E*-model* if *M* |= E.

*Term Orderings.* We write *u*[*s*] to denote that *s* is a subterm of *u*, where the subterm relation is denoted via ⊴. That is, *s* ⊴ *u*; similar notation will also be used for literals *L*[*s*] and clauses *C*[*s*]. We denote by *u*[*s* 7→ *t*] the term resulting from replacing all subterms *s* of *u* by *t*.

Multisets (of term, literals) are denoted with ˙{ *. . .* ˙}. For a multiset *S* and natural number *n* ∈ N, we defne 0 ∗ *S* = ∅, and *n* ∗ *S* = (*n* − 1 ∗ *S*) ∪ *S* for *n >* 0.

Let ≺ be a relation and ≡ be an equivalence relation. By ≺mul <sup>≡</sup> we denote the *multiset extension* of ≺, defned as the smallest relation satisfying *M* ∪ ˙{*s*1*, . . . , s<sup>n</sup>* ˙} ≺mul <sup>≡</sup> *<sup>N</sup>* <sup>∪</sup> ˙{*<sup>t</sup>* ˙}, where *M* ≡ *N*, *n* ≥ 0, and *s<sup>i</sup>* ≺ *t* for 1 ≤ *i* ≤ *n*. For *n, m* ∈ N, by ≺wmul <sup>≡</sup> we denote the *weighted multiset extension*, defned by ⟨ 1 *n , S*⟩ ≺wmul <sup>≡</sup> ⟨ 1 *<sup>m</sup> , T*⟩ if *m* ∗ *S* ≺mul <sup>≡</sup> *n* ∗ *T*. We omit the equivalence relation ≡ if it is clear in the context.

Let *s, t, t<sup>i</sup>* be terms, *θ, θ*′ be ground substitutions and E be a set of axioms. We write *s* ≡<sup>E</sup> *t* for E |= *s* ≈ *t* and *θ* ≡<sup>E</sup> *θ* ′ if for all variables *x* we have *xθ* ≡<sup>E</sup> *xθ*′ . We say that *s* is a E-subterm of *t* (*s* ⊴<sup>E</sup> *t*) if *s* ≡<sup>E</sup> *t*, or *t* ≡<sup>E</sup> *f*(*t*<sup>1</sup> *. . . tn*) and *s* ⊴<sup>E</sup> *t<sup>i</sup>* . We also say that *s* is a strict E-subterm of *t* (*s ◁*<sup>E</sup> *t*) if *s* ⊴<sup>E</sup> *t* and *s*̸≡<sup>E</sup> *t*.

## **4 Theoretical Foundation for Unifcation with Abstraction**

Our motivating example from Sect. 2 showcases that frst-order arithmetic reasoning requires (i) establishing syntactic diference among terms (e.g. 2*x* and *x* + 1), while (ii) deriving they have instances that are semantically equal in models of a background theory E (e.g. the theory Q).

A naive approach addressing (i)-(ii) would be to use an axiomatisation of the background theory E, and use this axiomatisation for proof search in uninterpreted frst-order logic. Such an approach can however be very costly. For example, even a relatively simple background theory **AC** axiomatizing commutativity and

```
1 fn uwa(s,t)
2 eqs ← {s ≈ t}; σ ← ∅; C ← ∅;
3 while eqs ̸= ∅
4 s˙ ≈ t˙ ← eqs.pop();
5 if s˙ ≈ t˙ ∈ {x ≈ u, u ≈ x} for some x ∈ V, x ̸◁ u
6 ⟨σ, eqs, C⟩ ← ⟨σ ∪ {x 7→ u}, eqs, C⟩{x 7→ u};
7 else if canAbstract( ˙s,t˙)
8 C.push( ˙s ̸≈ t˙);
9 else if s˙ = f(s1 . . . sn),t˙ = f(t1 . . . tn)
10 eqs.push({s1 ≈ t1 . . . sn ≈ tn})
11 else
12 return ⊥;
13 return ⟨σ, C⟩;
          Algorithm 1: Computing an abstracting unifer uwa.
```
associativity of ≈, that is **AC** = {*x*+*y* ≈ *y*+*x, x*+(*y*+*z*) ≈ (*x*+*y*)+*z*}, would make a superposition-based theorem prover derive a vast amount of useless/redundant formulas as equational tautologies. An approach to circumvent such inefcient handling of equality reasoning is to use *unifcation modulo* **AC**, or in general *unifcation modulo* E, as already advocated in [22, 34, 40]. In this section we describe the adjustments we made towards unifcation modulo E, allowing us to introduce *unifcation with abstraction* (Sect. 4.1). We also show under which condition our method can be used to turn a complete superposition calculus using unifcation modulo E into a complete superposition calculus using unifcation with abstraction. Concretely, we show how this can be used for the specifc theory of arithmetic Aeq in the calculus Alasca (Sect. 4.2).

## **4.1 Unifcation with Abstraction – UWA**

In a nutshell, unifcation modulo E fnds substitutions *σ* that make two terms *s, t* equal in the background theory, i.e. E |= *sσ* ≈ *tσ*. While unifcation modulo E removes the need for axiomatisation of E during superposition reasoning, it comes with some inefciencies. Most importantly, in contrast to syntactic unifcation, there is no unique most general unifer mgu(*s, t*) when unifying modulo E but only minimal complete sets of unifers mcu<sup>E</sup> (*s, t*), which can be very large; for example, unifcation modulo **AC** is doubly exponential in general [22].

Bypassing the need for unifcation modulo E, *fully abstracted clauses* are used in [40], without the need for axiomatisation of the theory E and without compromising completeness of the underlining superposition-based calculus. Our work extends ideas from [40] and adjusts *unifcation with abstraction* (uwa) from [34], allowing us to prove completeness of a calculus using uwa (Theorem 3).

*Example 1.* Let us frst consider the example of factoring the clause *p*(2*x*)∨*p*(*x*+1), a simplifed version of the unifcation step performed in line 5 in Fig. 1. That is, unifying the literals *p*(2*x*) and *p*(*x* + 1), in order to remove duplicate literals. Within the setting of [40], these literals would only exist in their fully abstracted

form, which can be obtained by replacing every subterm *t* : *τ*<sup>Q</sup> that is not a variable by a fresh variable *x*, and adding the constraint *x* ̸≈ *t* to the corresponding clause. Hence, the clause *p*(2*x*)∨*p*(*x*+ 1) is transformed to *p*(*y*)∨*p*(*z*)∨*y* ̸≈ 2*x*∨*z* ̸≈ *x*+ 1 in [40]. Unifcation then becomes trivial: we would derive the clause *p*(*y*) ∨ *y* ̸≈ 2*x* ∨ *y* ̸≈ *x* + 1 by factoring, from which *p*(2*x*) ∨ 2*x* ̸≈ *x* + 1 is inferred using equality factoring and resolution.

Within unifcation with abstraction, we aim at cutting out intermediate steps of applying abstractions, equality resolution and factoring. As a result, we skip unnecessary consequences of intermediate clauses, and derive the conclusion *p*(2*x*) ∨ 2*x* ̸≈ *x* + 1 straight away. To this end, we introduce constraints only for those *s, t* : *τ*<sup>Q</sup> on which unifcation fails. We thus gain the advantage that clauses are not present in the search space in their abstracted forms, increasing efciency in proof search. Further, our unifcation with abstraction approach is parametrized by a predicate canAbstract to control the application of abstraction, as listed in Algorithm 1. This is yet another signifcant diference compared to fully abstracted clauses, as in the latter, abstraction is performed for every subterm *t* : *τ*<sup>Q</sup> without considering the terms with which *t* might be unifed later.

Our uwa method can be seen as a lazy approach of full abstraction from [40]. We compute so-called abstracting unifers uwa(*s, t*) = ⟨*σ,* C⟩ in Algorithm 1, allowing us to replace unifcation modulo E by unifcation with abstraction.

**Defnition 1 (Abstracting Unifer).** *Let σ be a substitution and* C *a set of literals. A partial function* uwa *that maps two terms s, t either to* ⊥ *or to a pair* ⟨*σ,* C⟩ = uwa(*s, t*) *is called an* abstracting unifer*.*

The abstracting unifer uwa(*s, t*) computed by Algorithm 1 is parametrized by the relation canAbstract. The intuition of this relation is that canAbstract(*s, t*) holds for terms *s* and *t*, when *s* ≈ *t* might hold in the background theory E. To ensure that unifcation with abstraction can replace unifcation modulo E, we impose the following additional properties over the abstract unifer uwa(*s, t*).

**Defnition 2 (**uwa **Properties).** *Let σ be a substitution and* C *a set of literals. Consider s, t* ∈ **T** *be such that* uwa(*s, t*) = ⟨*σ,* C⟩ *and let θ be an arbitrary ground substitution. We say* uwa *is*


*Further,* uwa *is* E-complete *if, for all s, t* ∈ **T** *with* uwa(*s, t*) = ⊥*, we have* mcu<sup>E</sup> (*s, t*) = ∅*.*

Defnition 2 is necessary to lift inferences using unifcation with abstraction. We thereby want to assure that, whenever *C* does not hold, then *s* and *t* are

equal; hence abstracting unifers uwa(*x, y*) = ⟨∅*, x* + *y* ̸≈ *y* + *x*⟩ would be unsound. The E-generality property enforces that substitutions introduced by uwa are general enough in order to still be turned into a complete set of unifers. As such, E-generality is needed to rule out cases like uwa(*x* + *y,* 2) = ⟨{*x* 7→ 0*, y* 7→ 2}*,* ∅⟩, which would not be able to capture, for example, the substitution {*x* 7→ 1*, y* 7→ 1}. We note that we use uwa to extend counterexample-reducing inference systems (see Defnition 4), allowing inductive completeness proofs. As these inference systems need to derive conclusions that are smaller than the premises, we need the subterm-foundedness property to make sure to only introduce constraints that are smaller than the premises as well. If we have a look at the previous properties, we see that all of them are fulflled if uwa(*s, t*) = ⊥. Therefore we need to make sure that uwa only returns ⊥ when *s* and *t* are not unifable modulo E; this is captured by E-completeness.

In addition to properties of abstract unifers uwa(*s, t*), we also impose conditions over the canAbstract relation that parametrizes uwa(*s, t*). As Algorithm 1 only introduces equality constraints for subterm pairs that should be unifed, a resulting abstracting unifer uwa(*s, t*) is sound. Further, under the assumption that the clause ordering is defned as in standard superposition (e.g. using multiset extensions of a simplifcation ordering that fulflls the subterm property), the abstracting unifer uwa(*s, t*) is also subterm-founded. However, to ensure that uwa(*s, t*) is also minimal, interpreted functions should not be treated as uninterpreted ones; hence the canAbstract relation needs to always trigger abstraction on interpreted functions. Finally, we require that canAbstract does not skip terms which are potentially equal modulo E, in order to guarantee completeness. Hence, we defne the following properties for canAbstract.

#### **Defnition 3 (**canAbstract **Properties).** *Let s, t* ∈ **T***. The* canAbstract *relation*


Based on the above, we derive the following result.

**Theorem 1.** *The abstracting unifer* uwa *computed by Algorithm 1 is subtermfounded and sound. If* canAbstract *guards interpreted functions, then* uwa *is* E*general and* E*-minimal. If* canAbstract *guards interpreted functions and captures* E*, then* uwa *is* E*-complete.*

#### **4.2 UWA Completeness**

We now show how unifcation with abstraction (uwa) can be used to replace unifcation modulo E in saturation-based theorem proving [3]. We recall from [3] that in order to show refutational completeness of an inference-system *Γ*, one constructs a *model functor I* that maps sets of ground clauses *N* to candidate models *I<sup>N</sup>* . In order to show that *Γ* is refutationally complete, one needs to show that if *N* is saturated with respect to *Γ*, then *I<sup>N</sup>* ⊨ *N*. For this, the notion of a counterexample-reducing inference system is introduced.

**Defnition 4.** *We say an inference system Γ is* counterexample reducing*, with respect to a model functor I and a well-founded ordering on ground clauses* ≺*, if for every ground set of clauses N and every minimal C* ∈ *N such that I<sup>N</sup>* ̸⊨ *C, there is an inference*

$$\begin{array}{cccc} C\_1 & \dots & C\_n & C \\ \hline & D & \\ \end{array}$$

*where* ∀*i.I<sup>N</sup>* ⊨ *Ci,* ∀*i.C<sup>i</sup>* ≺ *C, D* ≺ *C, and I<sup>N</sup>* ̸⊨ *D.*

We then have the following key result.

**Theorem 2 (Bachmair&Ganzinger [3]).** *Let* ≺ *be a well-founded ordering on ground clauses and I be a model functor. Then, every inference system that is counterexample-reducing wrt* ≺ *and I is refutationally complete.*

This result also holds for an inference system being refutationally complete wrt E if for every *N* it holds that *I<sup>N</sup>* |= E. When constructing a refutationally complete calculus, one usually frst defnes a ground counterexample-reducing inference system and then lifts this calculus to a non-ground inference system. Lifting is done such that, if the ground inference system is counterexample reducing, then its lifted non-ground version is also counterexample reducing.

We next show how to transform a lifting of a counterexample-reducing inference system that uses unifcation modulo E into a lifting using unifcation with abstraction. That is, given a counterexample-reducing inference-system using unifcation modulo E to defne its rules, we construct another counterexamplereducing inference system that uses uwa instead. As we only transform rules that use unifcation, we introduce the notion of a unifying rule.

**Defnition 5.** *An inference rule γ is a* unifying rule *if it is of the form*

$$\begin{array}{cccc} C\_1 & \dots & C\_n & C \\ \hline \end{array}, \quad \begin{array}{cccc} C\_n & \dots & C \\ \hline \end{array}, \quad \text{where } \sigma \in \mathfrak{mcu}\_{\mathcal{E}}(s, t).$$

We also defne the mapping ◦uwa that maps unifying inferences *γ* to *γ*uwa as

$$\gamma\_{\mathsf{uwa}} = \left( \begin{array}{cc} C\_1 & \dots & C\_n & C \\ \hline & D\sigma \lor \mathcal{C} \end{array}, \text{ where } \langle \sigma, \mathcal{C} \rangle = \mathsf{uwa}(s, t) \right).$$

Soundness of the unifying rule *γ* alone however does not sufce to show soundness of *γ*uwa. Therefore we introduce a stronger notion of soundness that holds for all the rules we will consider to lift.

**Defnition 6.** *Let γ be a unifying rule. We say γ is* strongly sound *if* E*, C*<sup>1</sup> *. . . Cn, C* |= *s* ≈ *t* → *D.*

**Lemma 1.** *Assume that γ is strongly sound and* uwa *is sound. Then, γ*uwa *is sound.*

We note that not every inference can be transformed using ◦uwa, without compromising completeness. To circumvent this problem, we consider the notion of compatibility with respect to transformations.

**Defnition 7.** *Let γ be a unifying inference. Then, γ* unifes strict subterms *if for every grounding θ, u* ∈ {*s, t*} *there is an uninterpreted function or predicate f, a literal L*[*f*(*u*)]*, and clause C* ′ ∈ {*C*<sup>1</sup> *. . . Cn, C*}*, such that L*[*f*(*u*)]*θ* ⪯ *C* ′ *θ.*

Note that in the above defnition we usually have that *L*[*f*(*s*)] or *L*[*f*(*t*)] is some literal of one of the premises.

**Defnition 8 (**uwa**-Compatibility).** *We say an inference γ is* uwa compatible *if it is a unifying inference, strongly sound, and unifes strict subterms.*

**Theorem 3.** *Let* uwa *be a general, compatible, subterm-founded, complete, and minimal abstracting unifer. If Γ is the lifting of a counterexample-reducing inference system Γ <sup>ϑ</sup> with respect to a model functor I, and clause ordering* ≺*, then Γ*uwa = {*γ*uwa | *γ* ∈ *Γ, γ is* uwa*-compatible*}∪{*γ* ∈ *Γ* | *γ is not* uwa*-compatible*} *is the lifting of an inference system Γ ϑ* uwa *that is counterexample-reducing with respect to I and* ≺*.*

Theorem 1 and Theorem 3 together imply that, given a compatible inference system, we need to only specify the right canAbstract predicate in order to perform a lifting using uwa. In Sect. 5 we introduce the calculus Alasca, a concrete inference system with the desired properties, for which a suitable predicate canAbstract can easily be found.

## **5 ALASCA Reasoning**

We use the lifting results of Sect. 4 to introduce our Alasca calculus for reasoning in quantifed linear arithmetic, by combining superposition reasoning with Fourier-Motzkin type inference rules. While an instance of such a combination has been studied in the Lasca calculus of [26], Lasca is restricted to ground, i.e. quantifer-free, clauses. Our Alasca extends Lasca with uwa and provides an altered ground version Alasca*<sup>θ</sup>* (Sect. 5.1) which efciently can be lifted to the quantifed domain (Sect. 5.2). As quantifed reasoning with linear real arithmetic and uninterpreted functions is inherently incomplete, we provide formal guarantess about what Alasca can prove. Instead of focusing on completeness with respect to Q-models as in [26], we show that Alasca is complete with respect to a partial axiomatisation A<sup>Q</sup> of Q-models (Sect. 5.2).

### **5.1 The ALASCA Calculus – Ground Version**

The Alasca calculus uses a partial axiomatisation A<sup>Q</sup> of Q-models, and handles some Q-axioms via inferences and some via uwa. We therefore split the axiom set A<sup>Q</sup> into Aeq and Aineq, as listed in Fig. 2.

Our Alasca calculus modifes the Lasca framework [26] to enable an efcient lifting for quantifed reasoning. For simplicity, we frst present the ground version of Alasca, which we refer to Alasca*<sup>θ</sup>* , whose one key beneft is illustrated next.

$$\begin{aligned} \mathcal{A}\_{\mathsf{Q}} &= \mathcal{A}\_{\mathsf{eq}} \cup \mathcal{A}\_{\mathsf{inq}} \\ \mathcal{A}\_{\mathsf{eq}} &= \mathbf{AC} \\ &\cup \{jx + kx \approx (j+k)x \mid j, k \in \mathbb{Q}\} \\ &\cup \{j(k(x)) \approx (jk)x \mid j, k \in \mathbb{Q}\} \end{aligned} \qquad \begin{aligned} \mathcal{A}\_{\mathsf{inq}} &= \{x > y \wedge y > z \to x > z\} \\ \cup \{x > y \to x + z > y + z\} \\ &\cup \{x > y \vee x \approx y \vee y > x\} \\ \cup \{x > y \vee x \ge y\} \\ &\cup \{x \ge y \leftrightarrow (x > y \vee x \approx y)\} \\ \cup \{x > y \to -ky \mid k \in \mathbb{Q}\} \end{aligned}$$

**Fig. 2.** Axioms handled by the Alasca calculus. All are implicity universally quantifed.

*Example 2.* One central rule of Alasca is the Fourier-Motzkin variable elimination rule (FM). We use (FM) in line 7 of Fig. 1, when proving the motivating example of Sect. 2, given in formula (1). Namely, using (FM), we derive −2*x*−*y*+*sk >* 0 from *f*(2*x, y*) − 2*x* − *y >* 0 and −*f*(2*, y*) + *sk* ≥ 0, under the assumption that 2*x* ≈ 2. The (FM) rule can be seen as a version of the inequality chaining rules of [3] , chaining the inequalities *sk* ≥ *f*(2*, y*) and *f*(2*x, y*) *>* 2*x* + *y*. Moreover, the (FM) rule can also be considered a version of binary resolution, as it resolves the positive summand *f*(2*x, y*) with the negative summand −*f*(2*, y*), mimicing thus resolution over subterms, instead of literals. The main beneft of (FM) comes with its restricted application to maximal atomic terms in a sum (instead of its naive application whenever possible).

Alasca*<sup>θ</sup> Normalization and Orderings.* Compared to Lasca [26], the major diference of Alasca*<sup>θ</sup>* comes with focusing on which terms are being considered equal within inferences; this in turn requires careful adjustments in the underlying orderings and normalization steps of Alasca*<sup>θ</sup>* , and later also in unifcation within Alasca. In Lasca terms are rewritten in their so-called Q-normalized form, while equality inference rules exploit equivalence modulo **AC**. Lifting such inference rules is however tricky. Consider for example the application of the rewrite rule *j*(*ks*) → (*jk*)*s* (triggered by *j*(*ks*) ≈ (*jk*)*s*) over the clause *C*[*jx, x*]. In order to lift all instances of this rewrite rule, we would need to derive *C*[(*jk*)*x, kx*] for every *k* ∈ Q, which would yield an infnite number of conclusions. In order to resolve this matter, Alasca*<sup>θ</sup>* takes a diferent approach to term normalization and handling equivalence. That is, unlike Lasca, we formulate all inference rules using equivalence modulo Aeq, and do not consider the normalization of terms as simplifcation rules.

As Alasca*<sup>θ</sup>* rules use equivalence modulo Aeq, we also need to impose that the simplifcation ordering used by Alasca*<sup>θ</sup>* is Aeq-compatible. Intuitively, Aeqcompatibility means that terms that are equivalent modulo Aeq are in one equivalence class wrt the ordering. This allows us to replace terms by an arbitrary normal form wrt these equivalence classes before and after applying any inference rules, allowing it to use a normalization similar to Q-normalization that does not need to be lifted. Hence, we introduce Aeq*-normalized terms* as being terms

whose sort is not *τ*<sup>Q</sup> or of the form <sup>1</sup> *k* (*k*1*t*<sup>1</sup> + · · · + *kntn*), such that ∀*i.k<sup>i</sup>* ∈ Z \ 0, ∀*i* ̸= *j.t<sup>i</sup>* ̸≡ *t<sup>j</sup>* , ∀*i.t<sup>i</sup>* is atomic, *k* is positive, and gcd({*k, k*<sup>1</sup> *. . . kn*}) = 1. Obviously every term can be turned into a Aeq-normalized term. For the rest of this section we assume terms are Aeq-normalized, and write ≡ for ≡<sup>A</sup>eq . We also assume that literals with interpreted predicates ⋄ are being normalized (during preprocessing) and to be of the form *t* ⋄ 0. We write *s* ≈ˆ *t* for equalities, with sorts diferent from *τ*Q, and for equalities of sort *τ*<sup>Q</sup> that can be rewritten to *s* ≈ *t* such that *s* is an atomic term. Finally, Alasca*<sup>θ</sup>* also extends Lasca by not only handling the predicates *>* and ≈, but also ≥, and ̸≈, which has the advantage that inequalities are not being introduced in purely equational problems in Alasca*<sup>θ</sup>* .

As discussed in Example 2, the (FM) rule of Alasca*<sup>θ</sup>* is similar to binary resolution, as it can be seen as "resolving" atomic subterms instead of literals. To formalize such handling of terms in (FM), we distinguish so-called atoms(*t*), atoms of some term *t*. Doing so, given an Aeq-normalized term *t* = 1 *k* (±1*k*1*t*1+*. . .*±*nkntn*), we defne atoms<sup>±</sup>(*t*) = ⟨*k, k*<sup>1</sup> ∗ ˙{ ±<sup>1</sup> *t*<sup>1</sup> ˙} ∪ *. . .* ∪ *k<sup>n</sup>* ∗ ˙{ ±*<sup>n</sup> t<sup>n</sup>* ˙}⟩ and atoms(*t*) = ⟨*k, k*1∗ ˙{*t*1 ˙}∪*. . .*∪*kn*∗ ˙{*tn* ˙}⟩. We extend both of these functions *f* ∈ {atoms*,* atoms<sup>±</sup>} to literals as follows: *f*(*t*⋄0) = *f*(*t*), assuming that the term *t* has been normalised to <sup>1</sup> *<sup>k</sup>* = 1 before. For (dis)equalities *s* ≈ *t* (*s* ̸≈ *t*) of uninterpreted sorts, we defne atoms to be ⟨1*,* ˙{*s, t*˙}⟩. Further we defne maxAtoms(*t*), to be the set of maximal terms in atoms(*t*) with respect ≺, and maxAtom(*t*) = *t*<sup>0</sup> if maxAtoms(*t*) = {*t*0}.

Alasca*<sup>θ</sup> Inferences.* The inference rules of Alasca*<sup>θ</sup>* are summarized in Fig. 3a. All rules are parametrized by a Aeq-compatible ordering relation ≺ on ground terms, literals and clauses. Underlining a literal in a clause or an atomic term in a sum means that the underlined expression is non-strictly maximal wrt to the other literals in the clause, or atomic terms in the sum. We use double-underlining to denote that the expression is strictly maximal. We call **L** *θ* <sup>+</sup> the set of potentially productive literals, defned as all equalities and inequalities with strictly maximal atomic term with positive coefcient.

Finding a right ordering relation is non-trivial, as many diferent requirements, like compatibility, subterm property, well-foundedness, and stability under substitutions, need to be met [25, 26, 39, 41]. For Alasca, we use a modifed version of the Qkbo ordering of [26], with the following two modifcations.

(i) Firstly, the Alasca ordering is defned for non-ground terms. This means that the ordering needs to handle subterms with sums where there is no maximal atomic summand, like the term *x* + *y*. In addition, our ordering needs to be stable under substitutions in order to work with non-ground terms. Note however that our atom functions atoms and atoms<sup>±</sup> are not stable under substitutions, as the term *f*(*x*) − *f*(*y*) and the substitution {*x* 7→ *y*} demonstrates. Therefore, we parametrize our Alasca ordering by the relation subsSafe. The subsSafe relation fulfls the property that if subsSafe( 1 *k* (±1*k*1*t*<sup>1</sup> + · · · ±*<sup>n</sup> kntn*)), then there is no substitution *θ* such that ±*ikitiθ* ≡ ∓*jk<sup>j</sup> t<sup>j</sup> θ*, for any *i, j*. In general, checking the existence of such a *θ* is as hard as unifying modulo Aeq. Nevertheless, we can overapproximate the subsSafe relation using the canAbstract predicate.

#### **Fourier-Motzkin Elimination**

$$\begin{array}{ll} C\_1 \lor +j\_1 + t\_1 \ge\_1 0 & C\_2 \lor -k\_2 \not\to t\_2 \ge\_2 0\\ \hline \hline C\_1 \lor C\_2 \lor kt\_1 + jt\_2 > 0 & \text{(FM)}\\ \text{where} \quad - \; js + t\_1 > 0 \vdash C\_1\\ \quad - \; ks' + t\_2 > 0 \stackrel{\scriptstyle \frown}{\succeq} C\_2\\ \quad - \; s \equiv s'\\ \quad - \; \{>\} \subseteq \{\gtrsim\_1, \gtrsim\_2\} \subseteq \{\succ, \geq\} \end{array}$$

#### **Inequality Factoring**

$$\begin{array}{c} C \lor +j\_{\sf n} + t\_1 \ge\_1 0 \lor + k s'\_{\sf m} + t\_2 \ge\_2 0\\ \hline \\ C \lor k t\_1 - j t\_2 \ge\_3 0 \lor + k s' + t\_2 \ge\_2 0 \end{array} \text{ (IF)}$$
 
$$\begin{array}{c} \text{where} \ - \ \text{\$ s = s'}\\ - \ \forall L \in (C \lor j s + t\_1 \ge\_1 0), k s' + t\_2 \ge\_2 0 \succeq L \text{ or} \\ \forall L \in (C \lor k s' + t\_2 \ge\_2 0), j s + t\_1 \ge\_1 0 \succeq L \\ - \ \succeq j \in \{>, \ge\} \\ - \ \succeq j = \begin{cases} \ge \quad \text{if } \ge\_1 = \ge \text{, and } \ge\_2 = > \\ > \text{, else} \end{cases} \end{array}$$

**Contradiction**

$$\begin{array}{c} \begin{array}{c} C \succ \pm k \lhd 0\\ C \end{array} (\mathsf{Tr}\mathsf{W}) \end{array} $$
  $\text{where} \quad - \circ \in \{ >, \geq, \simeq, \nslash \} $   $\begin{array}{c} \begin{array}{c} - \circ \in \{ >, \geq, \simeq, \nslash \} \\ k \in \mathbb{Q} \\ - \circ \mathbb{Q} \mid \nleq \pm k \circ 0 \end{array} $ 

#### **Tight Fourier-Motzkin Elimination**

*<sup>C</sup>*<sup>1</sup> <sup>∨</sup> <sup>+</sup>*js* <sup>+</sup> *<sup>t</sup>*<sup>1</sup> <sup>≥</sup> <sup>0</sup> *<sup>C</sup>*<sup>2</sup> ∨ −*ks* ′ + *t*<sup>2</sup> ≥ 0 (FM<sup>≥</sup>) *<sup>C</sup>*<sup>1</sup> <sup>∨</sup> *<sup>C</sup>*<sup>2</sup> <sup>∨</sup> *kt*<sup>1</sup> <sup>+</sup> *jt*<sup>2</sup> *<sup>&</sup>gt;* <sup>0</sup> ∨ −*ks*′ <sup>+</sup> *<sup>t</sup>*<sup>2</sup> <sup>≈</sup> <sup>0</sup> where **–** *js* + *t*<sup>1</sup> *>* 0 ≻ *C*<sup>1</sup> **–** −*ks*′ + *t*<sup>2</sup> *>* 0 ⪰ *C*<sup>2</sup> **–** *s* ≡ *s* ′

#### **Term Factoring**

$$\frac{C \lor j\underline{s} + k\underline{s}' + t \diamond 0}{C \lor (j+k)s' + t \diamond 0} \text{ (TF)}$$

where **–** *s* ≡ *s* ′ **–** ⋄ ∈ {*>,* ≥*,* ≈<sup>ˆ</sup> *,* ̸≈} **–** *s, s*′ ∈ maxAtoms(*C* ∨ *js* <sup>+</sup> *ks*′ <sup>+</sup> *t* ⋄ 0) **–** there is no uninterperted literal in *C*

#### **Superposition**

$$\begin{array}{c} C\_1 \lor\_{\mathfrak{g}} s \not\simeq t \qquad C\_2 \lor L[s'] \\ \hline \\ C\_1 \lor C\_2 \lor L[s' \to t] \qquad (\mathsf{Sup}) \\ \hline \\ \mathsf{where} \quad - \ \mathsf{s} \equiv s' \\ \quad - \ \mathsf{s} \ \stackrel{\scriptstyle \mathsf{S}}{\Leftrightarrow} t \coloneqq C\_1 \\ \quad - \ L[s'] \in \mathbf{L}\_+^{\theta} \; \&\ L[s'] \succ C\_2 \text{ or} \\ \quad - \ \mathsf{L}[s'] \notin \mathbf{L}\_+^{\theta} \; \&\ L[s'] \succeq C\_2 \\ \quad - \ \mathsf{s}' \preceq x \in \mathsf{mao} \mathsf{Atoms}(L[s']) \\ \quad - \ \mathsf{s} \ \mathsf{s} \ \stackrel{\scriptstyle \mathsf{S}}{\Leftrightarrow} t \lor C\_1 \; \sim C\_2 \lor L[s'] \end{array}$$

#### **Equality Resolution**

#### *C* ∨ *s* ̸≈ *s* ′ (ER) *<sup>C</sup>* where **–** *s* ≡ *s* ′ **–** *s* ̸≈ *s* ′ ⪰ *C*

#### **Equality Factoring**

*C* ∨ *s* ≈ˆ *t*<sup>1</sup> ∨ *s* ′ ≈ˆ *t*<sup>2</sup> (EF) *<sup>C</sup>* <sup>∨</sup> *<sup>t</sup>*<sup>1</sup> ̸≈ *<sup>t</sup>*<sup>2</sup> <sup>∨</sup> *<sup>s</sup>* <sup>≈</sup> *<sup>t</sup>*<sup>1</sup> where **–** *s* ≡ *s* ′ **–** *s* ′ ≈ *t*<sup>2</sup> ⪰ *C* ∨ *s* ≈ *t*<sup>1</sup>

#### (a) Rules of the ground calculus Alasca*<sup>θ</sup>* .

#### **Variable Elimination**

*C* ∨ W *i*∈*I x* + *b<sup>i</sup>* ≳*<sup>i</sup>* 0 ∨ W *j*∈*J* −*x* + *b<sup>j</sup>* ≳*<sup>j</sup>* 0 ∨ W *k*∈*K x* + *b<sup>k</sup>* ≈ 0 ∨ W *l*∈*L x* + *b<sup>l</sup>* ̸≈ 0 (VE) V *K*+⊆*K C* ∨ W *i*∈*I,j*∈*J b<sup>i</sup>* + *b<sup>j</sup>* ≳*i,j* 0 ∨ W *i*∈*I,k*∈*K*− *b<sup>i</sup>* − *b<sup>k</sup>* ≥ 0 ∨ W *i*∈*I,l*∈*L b<sup>i</sup>* − *b<sup>l</sup>* ≳*<sup>i</sup>* 0 ∨ W *j*∈*J,k*∈*K*+ *b<sup>j</sup>* + *b<sup>k</sup>* ≥ 0 ∨ W *j*∈*J,l*∈*L b<sup>j</sup>* + *b<sup>l</sup>* ≳*<sup>j</sup>* 0 ∨ W *k*1∈*K*+*,k*2∈*K*<sup>−</sup> *b<sup>k</sup>*<sup>1</sup> − *b<sup>k</sup>*<sup>2</sup> ≥ 0 ∨ W *k*∈*K*+*,l*∈*L b<sup>k</sup>* − *b<sup>l</sup>* ≥ 0 ∨ W *k*∈*K*−*,l*∈*L b<sup>l</sup>* − *b<sup>k</sup>* ≥ 0 ∨ W *l*1*,l*2∈*L b<sup>l</sup>*<sup>1</sup> − *b<sup>l</sup>*<sup>2</sup> ̸≈ 0 

where

**–** *x* is an unshielded variable **–** *K*<sup>−</sup> = *K* \ *K*<sup>+</sup> **–** *C* does not contain *x* **–** <sup>≳</sup>*i,* <sup>≳</sup>*j*∈ {≥*, >*} **–** (≳*i,j* ) = (≥) if ≥∈ {<sup>≳</sup>*i,* <sup>≳</sup>*j*} (*>*) otherwise

> (b) Variable elimination rule used for lifting Alasca*<sup>θ</sup>* .

**Fig. 3.** Inference rules used to defne the calculus Alasca.

(ii) Secondly, we adjusted the Alasca ordering to be Aeq-compatible, instead of **AC**-compatible. We modifed the literal ordering of Alasca, such that literals are ordered by all their atoms using the weighted multiset extension of ≺, instead of only using the maximal one of each literal *L* as in [26].

We defne a model functor I · <sup>∞</sup> mapping clauses to AQ-models (see [23] for details) and conclude the following.

**Theorem 4.** Alasca*<sup>θ</sup> is a counterexample-reducing inference system with respect to* I · <sup>∞</sup> *and* ≺*.*

#### **5.2 ALASCA Lifting and Completeness**

*Variable Elimination.* Theorem 4 establishes completeness of Alasca*<sup>θ</sup>* for ground clauses wrt AQ. We next lift this result (and calculus) to non-ground clauses.

We introduce the concept of an *unshielded variable*. We say a term *t* : *τ*<sup>Q</sup> is a top level term of a literal *L* if *t* ∈ atoms(*L*). We call a variable *x unshielded* in some clause *C* if *x* is a top level term of a literal in *C*, and there is no literal with an atomic top level term *t*[*x*]. Observe that within the Alasca*<sup>θ</sup>* rules, only maximal atomic terms in sums are being used in rule applications. This means, lifting Alasca*<sup>θ</sup>* to Alasca is straightforward for clauses where all maximal terms in sums are not variables. Further, due to the subterm property, if a variable is maximal in a sum then it must be unshielded. Hence, the only variables we have to deal within Alasca rule applications are unshielded ones.

The work of [40] modifes a standard saturation algorithm by integrating it with a variable elimination rule that gets rid of unshielded variables, without compromising completeness of the calculus. Based on [40] and the variable elimination rule of [3], we extend Alasca*<sup>θ</sup>* with the Variable Elimination Rule (VE), as given in Fig. 3b. In what follows, we show that the handling of unshielded variables in Fig. 3b can naturally be done within a standard saturation framework.

The (VE) rules replaces any clause with a set of clauses that is equivalent and does not contain unshielded variables. We assume that the clause is normalized, such that in every inequality *x* only occurs once with a factor 1 or −1, whereas for for equalities, *x* only occurs with factor 1. A simple example for the application of (VE) is the clause *a* − *x >* 0 ∨ *x* − *b >* 0 ∨ *a* + *b* + *x* ≥ 0, where *x* ∈ **V**, and *a, b* are constants. By reasoning about inequalities, it is easy to see that this is equivalent to *a > x* ∨ *a* + *b* ≥ *x* ∨ *x > b*, thus further equivalent to *a > b* ∨ *a* + *b* ≥ *b*, which illustrates the beneft of variable elimination through (VE).

#### **Lemma 2.** *The conclusion of* (VE) *is equivalent to its premise.*

Alasca *Calculus - Non-Ground Version with Unifcation with Abstraction.* We now defne our lifted calculus Alasca, as follows. Let Alasca<sup>−</sup> be the calculus Alasca*<sup>θ</sup>* being lifted for clauses without unshielded variables. We defne Alasca to be Alasca<sup>−</sup> chained with the variable elimination rule. That is, the result of every rule application is simplifed using (VE) as long as applicable.

**Theorem 5.** Alasca *is the lifting of a counterexample-reducing inference system for sets of clauses without unshielded variables.*

Theorem 5 implies that Alasca is refutationally complete wrt A<sup>Q</sup> for sets of clauses without unshielded variables. As (VE) can be used to preprocess arbitrary sets of clauses to eliminate all unshielded variables, we get the following.

**Corollary 1.** *If N is a set of clauses that is unsatisfable with respect to* AQ*, then N can be refuted using* Alasca*.*

We conclude this section by specifying the lifting of Alasca*<sup>θ</sup>* to get Alasca<sup>−</sup> . To this end, we use our uwa results and properties for unifcation with abstraction (Sect. 4). We note that using unifcation modulo Aeq would require us to develop an algorithmic approach that computes a complete set of unifers modulo Aeq, which is a quite challenging task both in theory and in practice. Instead, using Theorem 1 and Theorem 3, we need to only specify a canAbstract predicate that guards interpreted functions and captures Aeq within uwa. This is achieved by defning canAbstract(*s, t*) if any function symbol *f* ∈ {sym(*s*)*,*sym(*t*)} is an interpreted function *f* ∈ Q ∪ {+}.This choice of the canAbstract predicate is a slight modifcation of the abstraction strategy one\_side\_interpreted of [34]. We note that this is not the only choice for the predicate to fulfl the canAbstract properties. Consider for example the terms *f*(*x*) + *a*, and *a* + *b*. There is no substitution that will make these two terms equal, but our abstraction predicate introduces a constraint upon trying to unify them. In order to address this, we introduce an alternative canAbstract predicate that compares the atoms of a term, instead of only looking at the outer most symbol (Sect. 6).

We believe more precise abstraction predicates can improve proof search, as evidenced by our experiments using second abstraction predicate (Sect. 6).

## **6 Implementation and Experiments**

We implemented Alasca <sup>5</sup> in the extension of the Vampire theorem prover [27].

*Benchmarks.* We evaluated the practicality of Alasca using the following six sets of benchmarks, resulting all together in 6374 examples, as listed in Table 1 and detailed next. (i) We considered all sets of benchmarks from the SMT-LIB repository [7] set that involve real arithmetic and uninterpreted functions, but no other theories. These are the three benchmark sets corresponding to the LRA, NRA, and UFLRA logics in SMT-LIB. (ii) We further used Sledgehammer examples generated by [15], using the SMT-LIB syntax. From the examples of [15], we selected those benchmarks that involve real arithmetic but no other theories. We refer to this benchmark set as SH. (iii) Finally, we also created two new sets of benchmarks, Triangular, and Limit, exploiting various mathematical properties. The Triangular suite contains variations of our motivating example from Sect. 2, and thus comes with reasoning challenges about triangular inequalities

<sup>5</sup> available at https://github.com/vprover/vampire/tree/alasca


Benchmarks (#) Alasca Cvc5 Vampire Yices UltElim SmtInt veriT solved

**Table 1.** Experimental results, showing the numbers of solved problems.

and continuous functions. The Limit benchmark set is comprised of problems that combine various limit properties of real-valued functions.

*Experimental Setup.* We compared our implementation against the solvers from the Arith (arithmetic) division of the SMT-COMP competition 2022. These solvers, given in columns 3–8 of Table 1, are: Cvc5 [5], Vampire [35], Yices [19], UltElim [8], SmtInt [21], and veriT [2]. We note that Vampire is run in its competition portfolio mode, which includes the work from [34]. Alasca uses the same portfolio but implements our modifed version of unifcation with abstraction (Sect. 4), disabling the use of theory axioms relying on our new Alasca rules (Sect. 5). We ran our experiments using the SMT-COMP 2022 competition setup: based on the StarExec Iowa cluster, with a 20 minutes timeout and using 4 cores. Benchmarks, solvers and results are publicly available<sup>6</sup> .

*Experimental Results.* Table 1 summarizes our experimental fndings and indicates the overall best performance of Alasca. For example, Alasca outperforms the two best arithmetic solvers of SMT-COMP 2022 by solving 118 more problems than Cvc5 and 159 more problems than Vampire.

## **7 Conclusions and Future Work**

We introduced the Alasca calculus and drastically improved the performance of superposition theorem proving on linear arithmetic. Alasca eliminates the use of theory axioms by introducing theory-specifc rules such as an analogue of Fourier-Motzkin elimination. We perform unifcation with abstraction with a general theoretical foundation, which, together with our variable elimination rules, serves as a replacement for unifcation modulo theory. Our experiments show that Alasca is competitive with state-of-the-art theorem provers, solving more problems than any prover that entered the arithmetic division in SMT-COMP 2022. Future work includes designing an integer version of Alasca, developing diferent versions for the canAbstract predicate, and improving literal/clause selections within Alasca.

**Acknowledgements.** This work was partially supported by the ERC Consolidator Grant ARTIST 101002685, the TU Wien Doctoral College SecInt, the FWF SFB project SpyCoDe F8504, and the EPSRC grant EP/V000497/1.

<sup>6</sup> https://www.starexec.org/starexec/secure/explore/spaces.jsp?id=535817

## **References**


Schoisswohl, J.: Vampire 4.7-SMT System Description. https://smt-comp.github.io/ 2022/system-descriptions/Vampire.pdf (2022)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## A Matrix-Based Approach to Parity Games

Saksham Aggarwal, Alejandro Stuckey de la Banda, Luke Yang, and Julian Gutierrez()

Monash University, Faculty of Information Technology, Melbourne, Australia {sagg0005,astu0006,lyan0042}@student.monash.edu julian.gutierrez@monash.edu

Abstract. Parity games are two-player zero-sum games of infinite duration played on finite graphs for which no solution in polynomial time is still known. Solving a parity game is an NP∩co-NP problem, with the best worst-case complexity algorithms available in the literature running in quasi-polynomial time. Given the importance of parity games within automated formal verification, several practical solutions have been explored showing that considerably large parity games can be solved somewhat efficiently. Here, we propose a new approach to solving parity games guided by the efficient manipulation of a suitable matrix-based representation of the games. Our results show that a sequential implementation of our approach offers very competitive performance, while a parallel implementation using GPUs outperforms the current state-of-the-art techniques. Our study considers both real-world benchmarks of structured games as well as parity games randomly generated. We also show that our matrix-based approach retains the optimal complexity bounds of the best recursive algorithm to solve large parity games in practice.

Keywords: Parity games · Formal verification · Parallel computing.

## 1 Introduction

Parity games are one of the most useful and effective algorithmic tools used in automated formal verification [18,5,2]. Indeed, several computational problems, such as model checking and automated synthesis using temporal logic specifications, can be reduced to the solution of a parity game [5,2]. More formally, a parity game is a two-player zero-sum game of infinite duration played on a finite graph. Since these games are determined [14,8], solving them is equivalent to finding a winning strategy for one of the two players in the game; or, similarly, deciding from which vertices in the graph one of the two players in the game can force a win no matter the strategy that the other player makes use of. The main question regarding parity games is that of the computational complexity of finding a solution of the game, a problem that is known to be in NP ∩ co-NP [11]. However, despite decades of research, a polynomial-time algorithm to solve such games remains elusive. The best-known decision procedures to solve parity games, most of them recently developed [4,13], run in quasipolynomial time, which provide better worst-case complexity upper bounds than previous exponential-time approaches [18] found in the parity games literature.

c The Author(s) 2023

The importance of parity games in the solution of real-life automated verification problems, and the lack of a polynomial-time decision procedure to solve such games, has motivated the development and implementation of algorithms that can solve parity games somewhat efficiently in practice, despite their known worst-case exponential time complexity. In the quest for developing such decision procedures, several different approaches have been investigated in the last two decades, ranging from solutions that try to improve/optimise on the choice of high-level algorithm to reason about parity games, the programming language used to implement such a solution, the concrete data structures used to represent the games, or the type of hardware architecture used for deployment [7,6,17,9].

Progress solving parity games in practice has been made in different directions. In [7], a state-of-the-art implementation of the best-known algorithms for solving parity games was presented. In this work, two algorithms were found to deliver the best performance in practice, namely, Zielonka's recursive algorithm (ZRA [18]) and priority promotion [3], with the former showing slightly better performance when solving random games and a selection of structured games for model checking, and the latter outperforming ZRA when solving a selection of structured games for equivalence checking. But, overall, the two algorithms expose extremely similar performance in practice, including that of a parallel implementation of ZRA. Another attempt to improve the performance of solving parity games is presented in [6]. In this work, better performance is sought through a parallel implementation of ZRA, known to consistently expose the best performance in different platforms and for different types of games.

These two works [7,6] contain two strikingly opposing conclusions. While in [7] the parallel implementation of ZRA is even outperformed by the best sequential implementation of the same algorithm, in [6] significant gains in performance are observed when parallelising the computation of ZRA – which may solve a large set of random parity games between 3.5 and 4 times faster than the sequential implementation of the same algorithm. These two results, arguably, both conforming with the state of the art in the solution of parity games in practice, indicate that no definitive conclusion can be made into what the best approach to solving parity games in practice is, let alone whether considering a parallel implementation would necessarily produce better results than its sequential version. In this paper, we present a new approach to solving parity games, and investigate some of the issues exposed by the two above papers.

More specifically, motivated by the need to find effective new techniques for solving parity games, in particular in large practical settings, in this paper we:


Our matrix-based approach, whose parallel implementation outperforms the state-of-the-art solvers for parity games, consists in the reduction of key operations on parity games as simple computations on large matrices, which can be significantly accelerated in practice using sophisticated techniques for matrix manipulation, specifically, using modern GPU technologies. Firstly, our matrixbased approach partly builds on the observation that most of the computation time when using ZRA is spent running a particular subroutine called the "attractor" function, which we can parallelise. Secondly, we also rely on the observation that computations on matrices – which guide the search for the solution of parity games within our approach – can be efficiently parallelised using a combination of both algorithmic techniques for parallel computation and GPU devices.

## 2 Preliminaries

A parity game is two-player zero-sum infinite-duration game played over a finite directed graph G = (V0, V1, E, Ω), where V = V<sup>0</sup> ∪ V<sup>1</sup> is a set of vertices/nodes partitioned into vertices V<sup>0</sup> controlled by Player Even/0 and vertices V<sup>1</sup> controlled by Player Odd/1. Whenever a statement about both players is made, we may use the letter q (∈ {0, 1}) to refer to either player, and 1 − q to refer to the other player in the game. Without any loss of generality, we also assume that every vertex in the graph has at least one successor. Moreover, the function Ω : V → N is a labelling function on the set of vertices of the graph which assigns each vertex a priority. Intuitively, the way a parity game is played is by moving a token along the graph (starting from some designated node in V ), with the owner of the node of which the token is on selecting a successor node in the graph. Because every vertex has a successor, this process continues indefinitely, producing a infinite sequence of visited nodes, and consequently an infinite sequence of seen priorities. The winner of a particular play is determined by the highest priority that occurs infinitely often: Player 0 wins if the highest infinitely recurring priority is even, while Player 1 wins if the highest infinitely recurring priority is odd. Parity games are determined, which means that it always the case that one of the two players has a strategy (called a winning strategy) that wins against all possible strategies of the other player. Solving a parity game amounts to deciding, for every node in the game, which player has a winning strategy for the game starting in such a node. That is computing disjoint sets W<sup>0</sup> ⊆ V and W<sup>1</sup> ⊆ V such that Player q has a winning strategy to win every play in the game that starts from a node in Wq, with q ∈ {0, 1}.

Somewhat surprisingly, the best performing algorithm to solve parity games in practice is Zielonka's Recursive Algorithm (ZRA [18]), which runs in exponential time in the number of priorities, bounded by |V |. This algorithm is rather simple, and mostly relies on the computation of attractor sets, which are sets of vertices A = Attrq(X) inductively defined for each Player q as shown below – and used to computing both W<sup>0</sup> and W<sup>1</sup> recursively. Formally, the attractor function Attr<sup>q</sup> : P(V ) → P(V ) for Player q, computes the attractor set of a given set of vertices U ⊆ V , and is defined inductively as follows:

```
Algorithm 1 Zielonka(G)
```

```
if V = ∅ then
   (W0, W1) ← (∅, ∅)
else
   m ← max{Ω(v) | v ∈ V }
   q ← m mod 2
   U ← {v ∈ V | Ω(v) = m}
   A ← Attrq(U)
   (W0
      0, W0
          1) ← Zielonka(G \ A)
   if W0
       1−q = ∅ then
      (Wq, W1−q) ← (A ∪ W0
                            q, ∅)
   else
      B ← Attr1−q(W0
                      1−q)
      (W0
         0, W0
             1) ← Zielonka(G \ B)
      (Wq, W1−q) ← (W0
                        q, W0
                            1−q ∪ B)
   end if
end if
return (W0, W1)
```

$$\begin{aligned} Attr\_q^0(U) &= U\\Attr\_q^{n+1}(U) &= Attr\_q^n(U) \\ &\cup \{ u \in V\_q \mid \exists v \in Attr\_q^n(U) : (u,v) \in E \} \\ &\cup \{ u \in V\_{1-q} \mid \forall v \in V : (u,v) \in E \Rightarrow v \in Attr\_q^n(U) \} \\ attr\_q(U) &= Attr\_q^{|V|}(U) \end{aligned}$$

As shown in Algorithm 1, ZRA [18] finds disjoint sets of vertices W0/W<sup>1</sup> from which Player 0/1 has a winning strategy. Through the computation of attractor sets, the algorithm works by recursively decomposing the graph, finding sets of nodes that could be forced towards the highest priority node(s), and hence building the winning regions W<sup>0</sup> and W<sup>1</sup> for each player in the game.

## 3 A matrix-based approach

Experimental results from [7] motivated us to investigate whether ZRA can be improved in practice, since such an algorithm shows the best performance both in random games as well as in several structured games found in practical settings. This finding is complemented by the observation made in [6], that when running ZRA most of the time is spent in the computation of attractor sets, reported to be about 99% in [6] (with experiments considering random games only), and found to be of about 77% in our study (which considers larger classes of games).

Our observation, and working hypothesis, not found in previous work [7,6], is that the basic ZRA can be highly optimised in practice if its main computation component – the attractor set subroutine – is accelerated using efficient

Algorithm 2 Attr(A, t, q, g, o)

```
d ← Ag
t
 0 ← 0
while kt 6= t
             0
              k1
                 6= 0 do
   t
    0 ← t
   v ← At
   t ← g  ((o = q)  (v > 0) + (o = (1 − q))  (v = d))
end while
return t
```
techniques for matrix manipulation, should a representation of the attractor set procedure was based on computations/operations on matrices encoding the attractor set subroutine in ZRA. This is precisely what we do in this section, which in turn makes our approach incredibly appropriate for an implementation in parallel using modern GPUs technologies for efficient matrix manipulation.

To achieve a matrix-based encoding of ZRA, and in particular of its attractor set subroutine, we redefine the representation of the graph in terms of a sparse adjacency matrix A, a vector defining the ownership of every node o, and a vector ω defining the priority of every node. Due to the potentially high computational cost of copying A, we maintain a vector g representing which nodes are still included in the game (a subgame being computed at that point in the algorithm), which is copied and updated as Zielonka's algorithm recurses and decomposes the graph into ever smaller parts. As such, we are able to find d = Ag, a vector containing the maximum out-degree of every node. More specifically:


$$-\ (\omega)\_i = \mathcal{Q}(V\_i);$$

– (g)<sup>i</sup> = 1, if node i is in the game; (g)<sup>i</sup> = 0, otherwise.

With these definitions in place, we can make the necessary modifications to the attractor function presented before – see Algorithm 2. The input/output vector t contains 1 at position (t)<sup>i</sup> where a node i is part of the attractor set and 0 otherwise. We thus define vectorised operations where if a vector is compared to another vector, then the comparisons are done element-wise. If a vector is compared to a scalar, then the scalar s is implicitly converted, s = s1. The  operator denotes the Hadamard product, which is used primarily as a Boolean And operation. The argument q is the player: 0 for Player 0 and 1 for Player 1.

This algorithm works by first finding the number of outbound edges each node has (d ← Ag), and at each iteration finding how many ways each node can enter the attractor set (v ← At). It then finds nodes that q owns that may enter the attractor set ((o = q)(v > 0)), and nodes that q do not own that are forced to enter the attractor set ((o = (1 − q))  (v = d)). It then filters the nodes to include into the attractor set depending on which nodes are still included in the subgraph (g  (· · ·)), and breaks the loop when there is no difference between t and t 0 . To illustrate this procedure, take as an example the graph below.

```
Algorithm 3 M atZielonka(A, g, o)
```

```
if g1 = 0 then
   (W0, W1) ← (0, 0)
else
   m ← max(g  ω)
   q ← m mod 2
   t ← (ω = m)
   t ← Attr(A, t, q, g, o)
   (W
      0, W
          1) ← M atZielonka(A, g − t, o)
   if 
     W
         1−q

             1 = 0 then
      (Wq, W1−q) ← (t + W
                           q, 0)
   else
      t ← Attr(A, W
                    1−q, 1 − q, g, o)
      (W
         0, W
             1) ← M atZielonka(A, g − t, o)
      (Wp, W1−p) ← (W
                        q, W
                            1−q + t)
   end if
end if
return (W0, W1)
```
For this example, assume that g = 1 and that we are computing the attractor set for the player that own the circle nodes, starting from the node with priority 7. After 1 (or some arbitrary number of iteration(s)), the current state is reached. Green nodes denote nodes included in the previous iteration's attractor set, and yellow nodes denote nodes that will be included in this iteration. The calculations that may be performed are as follows. Define the adjacency matrix of the graph (A), the currently included nodes in the attractor set, t = (1 1 0 0 0), the ownership of every node, o = (0 0 1 1 0), and the degree – number of outbound edges – of every node, d = Ag = (1 1 2 2 1). Now, compute the number of edges from each node leading to an element in the current attractor set, that is, v = At = (1 1 2 1 1), and with that, update t, to obtain: t ← (1 1 1 0 1), which exactly represents the value of the attractor function one step later. Similar changes for ZRA in terms of the representation of the game must also be made, so that it becomes, fully, a matrix manipulation algorithm (Algorithm 3).

The correctness of the algorithm remains unchanged from that of ZRA since our encoding into matrix operations is functional. Less clear is whether our algorithm retains the ZRA's complexity, since using a functional mapping does not necessarily imply that the encoding (our representation) has the complexity of the encoded instance (i.e., the original problem). We study this question next.

#### 3.1 Complexity

Using the algorithms defined before, we derive a function R(d, n) that bounds the maximum number of recursive calls to ZRA, given a d number of distinct priorities and n nodes: R(d, n) = 1 + R(d − 1, n − 1) + R(d, n − 1). The 1 is the original call; the 1st recursive call is made with at least the vertex with the largest priority removed, and the second is made with at least one vertex removed. Hence, the construction above. There are two base cases R(d, 0) = R(0, n) = 1. Firstly, we observer that based on the algorithms herein defined, we get:

$$\begin{aligned} R(d,n) &= 1 + R(d-1,n-1) + R(d,n-1) \\ &= (n+1) + \sum\_{i=1}^{n} [R(d-1,n-i)] \end{aligned}$$

Moreover, R(d, n) is then given by: f(d, n) = 2P<sup>d</sup> <sup>j</sup>=0 <sup>n</sup> j − 1. For the base case, when d = 1, we note that R(1, n) = (n + 1) + P<sup>n</sup> <sup>i</sup>=1[R(0, n − i)] = 2n + 1 and f(1, n) = 2P<sup>1</sup> <sup>j</sup>=0 <sup>n</sup> j − 1 = 2(n + 1) − 1 = 2n + 1 = R(1, n), as required, for all n. For the inductive case, assume that R(d, n) = f(d, n), for d = k and all n.

$$\begin{aligned} R(k+1,n) &= (n+1) + \sum\_{i=1}^{n} [R(k,n-i)] \\ &= (n+1) + \sum\_{i=1}^{n} [f(k,n-i)] \\ &= 1 + 2\sum\_{i=1}^{n} \sum\_{j=0}^{k} \binom{n-i}{j} = 2\sum\_{j=0}^{k+1} \binom{n}{j} - 1 = f(k+1,n) \end{aligned}$$

Hence, the statement is true for the base case d = 1 and all n, while the inductive case d = k implies d = k + 1. Thus, by induction, R(d, n) = f(d, n) for d ≥ 1 and all n. We now observe that the worst case number of calls occurs, as expected, at d = n where R(n, n) = 2<sup>n</sup>+1 − 1. Note that the complexity of a single call to MatZielonka has time complexity O(n 3 ) (dominated by the complexity of calls to the matrix-based Attr subroutine<sup>1</sup> ) and space complexity O(n), delivering worst-case complexities of O(n 3 · 2 <sup>n</sup>) time and O(n · 2 <sup>n</sup>) space.

This result, negative in theory, is consistent with that of the worst-case complexity of ZRA, which indicates that our matrix-based encoding retains the same complexity properties of the original algorithm. More interestingly, is the fact that the quasi-polynomial extension of ZRA by Parys [16], and later improved by Lehtinen et al [13], can also be tackled with our approach while retaining the quasi-polynomial complexity. However, a matrix-based extension of the latter algorithm was not evaluated. Thus, its practical usefulness is yet to be studied.

<sup>1</sup> In practice, this is dominated by the complexity of performing matrix multiplication operations, which is just slightly larger than O(n 2 ) and happens to be a vibrant topic of research recently due to improvements made through the use of Deep learning.

## 4 Implementation and evaluation

Several factors influence the practical performance of a computational solution to a problem: for instance, (1) the algorithm used to solve the problem, (2) the programming language to implement the solution, (3) the concrete data structures used to represent it, and (4) the hardware where the solution is deployed. Our solution tries to optimise 1–4 using both lessons learnt from previous research and properties of our own matrix-based approach. Details are given later, but in short, in this section, five parity game solvers are implemented and evaluated<sup>2</sup> :


Apart from (2), the five implementations above (I1–I5) will allow you to have a comprehensive evaluation of our approach, both against different versions of our own work and against previous research. The only aspect that all the solutions we present in this section have in common is the programming language used for implementation, which is C++, at present the language offering the most efficient practical implementation of parity games solutions; cf. [9,17,6,7]. We first present the characteristics of our matrix-based approach, deployed both as a sequential algorithm and as a parallelised procedure. After that, we will describe key features of the solutions originally developed elsewhere, and continue with the results of the evaluation using different types of parity games.

Matrix-based approach.<sup>3</sup> Whilst it is important to find performance from parallelisable operations, it is equally important to avoid the loss of performance from executing inefficient or slow operations. Specific algorithmic design choices such as maintaining a vector g to track nodes that are in or out of the graph are done to avoid otherwise necessary operations such as copying the adjacency matrix, which would otherwise be slow, especially when solving very large games.

Additionally, all values in vectors and matrices are stored as single precision floating point values in practice. This is due to the software limitations of the Compute Unified Device Architecture (CUDA) [15] library, which are likely limitations of the underlying hardware itself. In particular, this limits the maximum out-degree of a node to 2 <sup>24</sup>, which corresponds to the number of bits in the mantissa of a single precision floating point number (23), plus one. Beyond this limit, the accuracy of the values computed in operations such as computing the maximum out-degree of a node with Ag would no longer be guaranteed, along with the correctness of the algorithm. We note that this limitation may be overcome by splitting a single node into multiple nodes, thus curbing the maximum out degree to an acceptable range. We do not do this for these experiments as this transformation has unknown impacts on the performance of the algorithm.

<sup>2</sup> All files (implementations, experiments, input games, etc.) can be found in [1].

<sup>3</sup> The description here applies to the first two solutions described above.


#### Algorithm 4 Attr(A, t, q, g, o)

The invocation of functions that run on the GPU (known as kernels) have an overhead, with the overhead duration varying somewhat between devices. As a consequence, tuning for a particular problem depends on the functions being executed and the GPUs themselves. Thus, there are periods where the device is idle, and this is a result of the overheads. Also note that in practice, it is usually faster to perform multiple iterations of the attractor computation as performing an iteration when the full attractor set has already been computed does not alter the results (Algorithm 4). This is because queueing multiple kernel invocations has the same overhead as calling one kernel alone. The main difference between our sequential and parallel implementations of the matrix-based method is the function computing attractor sets, which is as in Algorithm 2 in the sequential case, and as in Algorithm 4 in the parallel case. The code in . . . is the same in both implementations, and the key difference is that we set the execution of the parallel implementation to make 3 kernel invocations per execution of the attractor function – which in lucky cases may require only 1 kernel invocation, while in unlucky cases may require more than 3 kernel invocations, increasing overheads; for our problem, we found that 3 kernel invocations was appropriate.

We find that there is another possible point of optimisation as the time taken for the attractor computation would be approximately equal to ctc+nto, where c is the number of attractor computations (the inside section of the for loop), n is the number of times the outer while loop will run, t<sup>c</sup> is the time to run the for loop once, and t<sup>o</sup> is the overhead incurred by switching execution from device (GPU) to host (CPU) as the condition is checked in the while loop. Ideally, c = C+1, and n = 1, where C is the (unknown) number of attractor computations required. Our implementation loops the inner for loop an arbitrary constant number of times (3 times here). As such, C + 1 ≤ c ≤ C + 3, and n = d C 3 e.

Importantly, requirements for the efficient parallelisation of the algorithm on the GPU require us to select the 'Naive attractor' implementation as the underlying algorithm (Algorithm 2) to be parallelised (leading to Algorithm 4) rather than the 'Improved attractor' implementation in [6]. The concepts of 'Naive' and 'Improved' attractors are presented by Arcucci et al in [6]. In short, the 'Naive' attractor loops over each node and checks if it can be included in the attractor set, and repeats this until no further nodes can be added. The 'Improved' attractor starts from the original attractor set, performing backpropagation on their inbound edges to find other nodes that may be included in the set.

GPU deployment. Our GPU implementation works by parallelising the "attract" operation.<sup>4</sup> Whilst the sequential version may be executed as such:

	- (Loop 2) For each node, check if it can be included in the attractor set.

And the runtime operations may look like:

	- Can node 1 be included in the attractor set?
	- ...
	- Can node N be included in the attractor set?

Performance is found through the inner loop being efficiently parallelised on the GPU. Additional specifics include the following GPU deployment features. When asking "Can node X be included ...?", the computation taking place is:


Key to our approach is that these operations are efficiently parallelised through means of matrix multiplication operations on the GPU. It is done as such:


Note we convert the previous logic on sets to suit the new form using vectors:

K ∩ J 6= ∅ ⇔ k<sup>i</sup> 6= 0 and K ⊆ J ⇔ k<sup>i</sup> = t<sup>i</sup> .

Improved attractor implementation by Arucci et al [6]. The third parity game solver we evaluate is a custom, C++, implementation of the ZRA using the 'Improved attractor' algorithm in [6], originally implemented in JAVA there.

ZRA implementations in Oink [7]. The fourth and fifth implementations we evaluate and compare against are the most highly optimised implementation of ZRA developed in [7], and its unoptimised version – without pre-processing routines. We include this implementation since our matrix-based ('Naive') implementation is not optimised in terms of the pre-processing routines used for implementation. These solvers in Oink are referred to as zlk and uzlk in [7]. We note that the parallel implementation of this algorithm is not included since in [7] is shown that it usually is outperformed by zlk, which we include here.

<sup>4</sup> A very different approach, leading to a very different GPU deployment is done in [10].

#### 4.1 Evaluation

The implementations evaluated in this paper were tested on a wide repository of parity games, and against state-of-the-art parity game solvers in the literature. The games used for performance evaluation include the suite by Keiren [12] (of games representing model checking and equivalence checking problems) and an additional set of variably sized random games generated by PGSolver [9].<sup>5</sup>

We evaluate the performance in terms of solve time of each of the solvers and for each of the games. As it is common practice when evaluating different solvers for parity games, the overheads incurred due to startup and game loading are not included; this is done in order to obtain numbers that estimate only the running time of the algorithms, and nothing else. With the same aim, we ensured that at most one solver is running at any time, with CPU utilisation not exceeding more than one core. Finally, in order to allow for a fair comparison of running times only – rather than combining such results with the robustness of the algorithms – we measured the time solving an instance only in case all implementations successfully compute a solution. This allows for a fairer comparison with respect to runtime performance purely, because failing a game usually implies an extremely disproportionately (and arbitrary) high runtime. Such failures include timeouts (at 5 minutes) or being unable to load the game, sometimes due to factors having little to do with the running time of the algorithms. Our experiments were conducted in the Google Cloud Platform (GCP) using a T4 n1-highmem-2.<sup>6</sup>

Profile of the input parity games. Our study includes more than 2000 parity games, with sizes ranging from only a few dozens of states to games with millions of states. Both nodes' out-degrees and number of distinct priorities also cover a wide range of dimensions. However, both random games and structured ones (model checking and equivalence checking) typically are represented by sparse graphs, a feature that we will leverage for implementation purposes.

## 5 Analysis of results

As can be seen from Tables 1, 2, and 3, we evaluate the main five implementations, all of them following the ZRA philosophy, using two types of parity games: structured and random. Both types of benchmarks are as in [7] and [6], arguably, the two best implementations of ZRA. The focus of this evaluation is to understand the usefulness and scalability of the 'GPU matrix' algorithm, which is the one embodying more cleanly our working hypothesis, namely, that the combination of a matrix-based representation of ZRA and the use of modern GPU technologies can outperform the state of the art in the design of algorithms for parity games – a hypothesis for which we provide strong evidence here.

<sup>5</sup> These random games were generated using parameters that are identical to those of the random games in the 'PGSolver' collection in the suit of benchmarks by Keiren.

<sup>6</sup> In order to compare performance in different hardware (GPU) architectures, we use a different technology for experiments presented in a forthcoming section.


Table 1: Times are in milliseconds (ms) representing the average time taken to solve games that all implementations passed (i.e., if any implementation fails to solve a game, the game is excluded from the time average of all five solvers, including an additional GPU implementation on an RTX2060S, presented later). Failures occur with a small number of large equivalence checking games only. Failures include a few timeouts (at 5 mins), and usually being unable to load the game in memory due to hardware limitations posed by the GPU architectures. Columns P/F show the number of games passed/failed for every type of game.


Table 2: Results in this table are formatted as in Table 1. In this table, we report the performance (average time in milliseconds taken to solve a single game) for the 5 algorithms on large (>1M nodes) parity games only.


Table 3: Results in this table are formatted as in Table 1. In this table, we report the performance (average time in milliseconds taken to solve a single game) for the 5 algorithms on "small" (<1M nodes) parity games only: results for structured and random games appear in the top table and for random games (detailed) at the bottom. In the bottom table, there are 200 games per column, apart from column 640K which has 100 games; there are no failures.

The results above also show that going from the sequential version of our approach, 'Naive (matrix) attractor' to its parallel implementation using GPU technologies finds significant improvements. These two main "internal" results are then compared with the state of the art in the algorithmic design of solutions based on ZRA, namely, using the improved attractor in [6] and using the highly optimised procedure zlk in Oink [7], which even outperforms its own parallel implementation; cf. [7]. Finally, the unoptimised version in Oink of this procedure, uzlk, is also included simply because our matrix-based procedure does not contain any of the pre-processing routines that differentiate zlk from uzlk. Thus, in a way, uzlk provides results for a somewhat fairer comparison.

GPU matrix vs Naive (matrix) attractor. Results in all tables show that the parallel implementation using GPU technologies outperforms its own sequential implementation ('Naive matrix attractor') by several orders of magnitude, with some exceptions, usually ranging from 5 times faster in some cases (e.g., model checking of large games) to more than 10 times faster (e.g., model checking of small games). This, we believe, is due to the fact that the bigger the input instances to be analysed the more any losses in the associated overheads of running the procedure in parallel are compensated later on. A trend going in that direction can be observed in detail when comparing the performance of these two algorithms over small random games. But, in any case, our matrixbased approach is always at least as good as its sequential implementation.

GPU matrix vs Improved attractor. The results show that the parallel matrixbased approach can outperform the improved attractor procedure by Arcucci et al [6] by 2-7 orders of magnitude, depending on the type of game being solved, and with the best results obtained when solving random games, whether large or small. However, the sequential version of 'GPU matrix', that is, the Naive implementation, usually is twice slower than the improved attractor implementation in structured games. Contrarily, even the (sequential) Naive implementation of the matrix-based method outperforms the improved attractor procedure over random games, being about 30% overall in that case. When looking at all the tables of results together, one can see that this is in fact an indicator of the fact that the improved attractor approach performs somewhat poorly over random graphs, at least when compared to its performance over structured games.

GPU matrix vs Oink. Even thought the GPU matrix-based implementation outperforms Oink's zlk, it usually does it only by a 1.5 to 2.0 factor, with the GPU implementation performing more efficiently over (large) random games than over structured ones. This result actually speaks very highly of the optimised sequential implementation of ZRA. However, as shown in [7], zlk performs even better than its own parallel implementation (called zlk-8 in [7]) when solving model checking parity games (by a very small margin) and when solving random games, where it is nearly twice faster; cf. Table 3 of [7]. Only when solving equivalence checking parity games zlk-8 outperforms zlk, but only by about a 13% margin. In contrast, the GPU implementation here outperforms zlk by more than a 70% margin, and is even twice faster when solving small equivalence checking games.

However, as we can see from all tables, the GPU matrix-based implementation has some failures (running timeout or failure to upload the game in memory, mainly due to their size), while the improved attractor method never fails in the considered set of benchmarks. This indicates that in this particular case, there may be a choice to be made between some potentially marginal gain in efficiency and more reliability offered by zlk. On the other hand, zlk clearly outperforms the sequential (Naive) implementation of the matrix-based approach, with better efficiency going from twice faster when solving random games to about four times faster when solving structured games. Regarding performance against Oink's uzlk, all analyses above remain similar, only that a better factor is usually obtained in favour of the GPU matrix-based approach.

Improved attractor vs Oink's zlk. Despite these two procedures being originally developed previously, we would like to comment on their comparative performance, for the sake of completeness of the analysis. As can be seen from our results, both offer the same reliability as they do not fail to solve any instance. Regarding runtime efficiency, we can observe that, on average, Oink's zlk implementation tends to be 1.5 to 3.0 times faster than the improved attractor method, with the worst/best comparative performance being enacted when solving model checking/random parity game instances, and in that way making zlk perhaps the most efficient sequential implementation of ZRA currently available in the literature, and being outperformed only when a parallel approach is considered.

## 6 Special cases

In this section, we analyse in more details two special cases of our results: performance when solving large parity games and performance on random games.

#### 6.1 Solving large parity games

For the purposes of this section, a large parity game is a game with more than 1 million nodes. Our results show that for games that are not large (Table 3), all solvers may be regarded as running efficiently from a human perspective, with some random games with more than 500K nodes being solved in about half a second by the slowest implementation on random games (the improved attractor implementation). In most other instances, solutions may be obtained in just a few milliseconds. For instance, model checking parity games in the suite of benchmarks can be solved in less than 0.1 minutes by any studied solver, and even in less than 10 milliseconds on average using the parallel GPU matrixbased approach, with Oink implementation taking virtually the same time (just a little more than 10 milliseconds on average). Then, the real challenge when solving parity games in practice is solving large parity games, where the relative performances between different solvers can be much better exposed (Table 2).

Our results show (Tables 1 and 2) that, despite the raw data being different in about 9 orders of magnitude, nearly the same relative performance is obtained when looking at performance over all games with respect to performance over large games only, which account for no more than 15% of the games for equivalence checking games, 10% for model checking games and less than 5% for random games. This result indicates that in order to evaluate the performance of parity games solvers in practice, one should better focus on large games only. As the data shows, in that case that parallel GPU matrix-based approach outperforms the second-best technique by, approx., a 1.5-2.0 factor, and its own sequential implementation by a factor of 4 to 5, in each case, depending on the type of parity game under consideration. The analysis holds across all solvers.

## 6.2 Solving random parity games

Random parity games are a common benchmark for parity games solvers, being the focus of the study on [6]. Our detailed experiments on random parity games show that the parallel GPU implementation of the matrix-based approach is comparable to the parallel implementation of the improved attractor implementation in [6] (see Table 3 there), in the sense that a similar relative gain in performance is achieved, overall, performing about 3.5-4.0 times faster over random games of up to 20K nodes. The gain in performance increases in our case when considering larger random graphs, perhaps indicating that our approach may be more scalable in terms of running time; however, in [6], only results on random games of up to 20K nodes are presented. We note that, in this case, only by changing the programming language of choice (JAVA in [6] and C++ here), performance is improved going from games of 20K size being solved in more than 5 seconds to the same type of games being solved in just 7ms on average here.

## 7 Alternative implementations

In this section, we explore two alternative implementations, one focused on a change of programming environment and another one based on a change of computer architecture. Our results show that while the former is well outperformed by the original C++ implementation, the latter shows even better performance than the already reported can be achieved when using other GPU technologies.

A MATLAB implementation. Given its facility to perform matrix operations, we investigated a MATLAB of our matrix-based approach to understand if it could perform better than our original C++ implementation. The results were negative. The MATLAB implementation of our approach, although simple, performed significantly worse than other methods, including our own using C++. A summary of the results, which require little discussion, can be found in Table 4.

Using a different GPU technology. We conducted experiments using the exact same implementation of the GPU matrix solver (run on a GCP) on a different


Table 4: Results in this table are formatted as in Table 1. We report results on all games, and in each case, independently, remove the time of unsolved instances.


Table 5: Results in this table are formatted as in Table 1. We report results on all games, which show an improvement of a 1.5x factor for structured games, while performing approximately 25% slower over random parity games.

GPU architecture, namely, on an RTX2060 Super (Ryzen 5 3600). We found that by simply changing to this alternative hardware specification, the results on all types of games were significantly better, as shown in the Table 5.

## 8 Concluding remarks and related work

We have shown that a new method for solving parity games using a matrix-based approach can outperform the state-of-the-art techniques, both sequential and parallel, currently available. As such, our results become a new point of comparison when evaluating modern solvers for parity games. Previous research [7,6,17,9] has shown that ZRA is potentially the best performing algorithm to solve parity games in practice, and here we provide more evidence that this is indeed the case. We also give evidence that C++ implementations for this task are hardly ever outperformed in practice. Finally, we also show that choosing the right computer architecture is key to achieve optimal performance, and in particular that in the case of modern GPU technologies, such a choice can make a significant difference in practice – in our study, leading to the development of the, as of today, most efficient parallel implementation/solver for parity games.

Acknowledgement. This research was funded by the Monash Laboratory for the Foundations of Computing (MLFC) and the Monash Faculty of Information Technology (FIT). Parts of this research were developed as FIT3144 projects ("Advanced computer science research project") at Monash in 2022. Preliminary results on the matrix-based approach to parity games were also developed by Henri Urpani during his FIT3144 project under Gutierrez's supervision in 2021. Finally, we thank the reviewers for helpful comments that improved this paper.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## A GPU Tree Database for Many-Core Explicit State Space Exploration

Anton Wijs() and Muhammad Osama

Eindhoven University of Technology, Eindhoven, The Netherlands {a.j.wijs,o.m.m.muhammad}@tue.nl

Abstract. Various techniques have been proposed to accelerate explicitstate model checking with GPUs, but none address the compact storage of states, or if they do, at the cost of losing completeness of the checking procedure. We investigate how to implement a tree database to store states as binary trees in GPU memory. We present fine-grained parallel algorithms to find and store trees, experiment with a number of GPUspecific configurations, and propose a novel hashing technique, called Cleary-Cuckoo hashing, which enables the use of Cleary compression on GPUs. We are the first to assess the effectiveness of using a tree database, and Cleary compression, on GPUs. Experiments show processing speeds of up to 131 million states per second.

Keywords: Explicit state space exploration, finite-state machines, GPU.

## 1 Introduction

Major advances in computation increasingly need to be obtained via parallel software, as Moore's Law is ending [30]. In the last decade, GPUs have been successfully applied to accelerate various computations relevant for model checking, such as probability computations for probabilistic model checking [8,25,48], counterexample construction [54], state space decomposition [52], parameter synthesis for stochastic systems [12], and SAT solving [34–38,40,43,56,57]. VoxLogicA-GPU applies model checking to analyse (medical) images [9].

In the earliest work on GPU explicit state space exploration, GPUs performed part of the computation, specifically successor generation [18, 19] and property checking once the state space has been generated [5]. This was promising, but the data copying between main and GPU memory and the computations on the CPU were detrimental for performance. The first tool that performed the entire exploration on a GPU was GPUexplore [33, 50, 51, 53]. It was later extended to support LTL model checking [49]. A similar exploration engine was later proposed in [55]. An approach that applied a GPU to explore the state space of Promela models, i.e., the models for the Spin model checker [21], was presented in [6]. This was later adapted to the swarm checker Grapple [16], which can efficiently explore very large state spaces, but at the cost of losing completeness. Finally, the model checker ParaMoC for pushdown systems was presented in [46, 47].

© The Author(s) 2023

S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 684–703, 2023. https://doi.org/10.1007/978-3-031-30823-9 35

The above techniques demonstrate the potential for GPU acceleration of state space exploration and (explicit-state) model checking, being able to accelerate those procedures tens to hundreds of times, but they all have serious practical limitations. Several limit the size of state vectors to 64 bits [6, 55] or the size of transition encodings to 64 bits [46, 47]. GPUexplore does not efficiently support models with variables [50, 53]. When adding variables, the amount of memory needed rapidly grows, due to the growing input model and inefficient state storage. Grapple requires less memory, but uses bitstate hashing. This rules out the ability to detect that all reachable states have been explored, which is crucial to prove the absence of undesired behaviour. ParaMoC verifies pushdown systems, but does not support concurrency, and abstracts away data.

Contributions. We propose how to perform memory-efficient complete state space exploration on a GPU for concurrent Finite-State Machines (FSMs) with data. To make this possible, we are the first to investigate the storage of binary trees in GPU hash tables, propose new algorithms to find and store trees in a fine-grained parallel fashion, experiment with a number of GPU-specific configurations, and propose a novel hashing technique called Cleary-Cuckoo hashing, which enables the use of Cleary compression [13,15] on GPUs. To achieve this, we have to tackle the following challenges: 1) CPU-based algorithms are recursive, but GPUs are not suitable for recursion, and 2) accessing GPU global memory, in which the hash tables reside, is slow. This work marks an important step to pioneer practical GPU accelerated model checking, as it can be extended to checking functional properties of models with data, and paves the way to investigate the use of Binary Decision Diagrams [29] for symbolic model checking.

The structure of the paper is as follows. In Section 2, we discuss related work on GPU hash tables. Section 3 presents background information on GPU programming, and Section 4 contains an overview of the state space exploration engine. Section 5 addresses the challenges when designing a GPU tree table, and presents our new algorithms. Experimental results are given in Section 6, and in Section 7, conclusions and our future work plans are discussed.

## 2 Related Work

An overview of related work on GPU acceleration of model checking is given in Section 1. In the current section, we focus on hash tables [14] for the GPU. In explicit state space exploration, states are typically stored in a hash table. Such a table is often implemented as an array, where the elements represent the hash table buckets. A recent survey of GPU hash tables [31] identifies that when using integer data items and unordered insertions and queries, Cuckoo hashing [41] is (currently) the best option, compared to techniques such as chaining [3] or robin hood hashing [20], and the Cuckoo hashing of [1] is particularly effective. In Cuckoo hashing, collisions, i.e., situations where a data item e is hashed to an already occupied bucket, are resolved by evicting the encountered item e 0 , storing e, and moving e 0 to another bucket. A fixed number of m hash functions is used to have multiple storage options for each item. Item look-up and storage is therefore limited to m memory accesses, but can lead to chains of evictions. In [1], it is demonstrated that with four hash functions, a hash table needs around 1.25N buckets to store N items.<sup>1</sup> Recent research [4] has demonstrated that using larger buckets, spanning multiple elements, that still fit in the GPU cache line is beneficial for performance, and increases the average load factor, i.e., how much the hash table can be filled until an item cannot be inserted, to 99%. We address this in detail in Section 3. However, in [4], an older NVIDIA GPU of the Volta architecture was used (2017), while more recent GPUs are supposedly less susceptible to optimisations exploiting the cache line. In this work, we experimentally assess this for hash table buckets.

Besides buckets, we also consider Cuckoo hashing as used in [1, 4], but we are the first to investigate the storage of binary trees, and the use of Cleary compression to store more data in less space. Libraries offering GPU hash tables, such as [23], do not offer these capabilities. Furthermore, we are the first to investigate the impact of using larger buckets for binary tree storage embedded in a state space exploration engine.

The model checker GPUexplore [11, 50, 53] uses multiple hash functions to store a state. State evictions are never performed, as each state is stored in a sequence of integers, making it not possible to store states atomically. This can lead to storing duplicate states, which tends to be worsened when states are evicted, making Cuckoo hashing not practical [51]. Besides compact state storage, a second benefit of using trees with each node being stored in a single integer is that it allows arbitrarily large states to be stored atomically, i.e., a state is stored the moment the root of its tree is stored.

Because we store trees, with the individual nodes referencing each other, we do not consider alternative storage approaches, such as using a list that is repeatedly sorted, even though Alcantara et al. identified that using radixsort [32] is competitive to hashing [1].

## 3 GPU programming

CUDA<sup>2</sup> is a programming interface that enables general purpose programming for a GPU. It has been developed and continues to be maintained by NVIDIA since 2007. In this work, we use CUDA with C++. Therefore, we use CUDA terminology when we refer to thread and memory hierarchies.

The left part of Fig. 1 gives an overview of a GPU architecture. For now, ignore the bold-faced words and the pseudo-code. A GPU consists of a finite number of streaming multiprocessors (SM), each containing hundreds of cores. For instance, a Titan RTX, which we used for this work, has 72 SMs containing together 4,608 cores. A programmer can implement functions, named kernels, to

<sup>1</sup> This refers to the single-level version of their Cuckoo hashing [1], which we consider in this work. Their two-level version is more complex and less efficient.

<sup>2</sup> https://developer.nvidia.com/cuda-zone.

Fig. 1: State space exploration on a GPU architecture.

be executed by a predefined number of GPU threads. Parallelism is achieved by having these threads work on different parts of the data.

When a kernel is launched, threads are grouped into blocks, usually of a size equal to a power of two, often 512 or 1,024. Each block is executed by one SM, but an SM can interleave the execution of many blocks. When a block is executed, the threads inside are scheduled for execution in smaller groups of 32 threads called warps. A warp has a single program counter, i.e., the threads in a warp run in lock-step through the program. This concept is referred to as Single Instruction Multiple Threads (SIMT): each thread executes the same instructions, but on different data. The threads in a warp may also follow diverging program paths, leading to a reduction in performance. For instance, if the threads of a warp encounter an if C then P1 else P2 construct, and for some, but not all, C holds, all threads will step through the instructions of both P1 and P2, but each thread only executes the relevant instructions.

GPU threads can use atomic instructions to manipulate data atomically, such as a compare-and-swap on 32- and 64-bit integers: atomicCAS(addr, compare, val) atomically checks whether at address addr, the value compare is stored. If so, it is updated to val, otherwise no update is done. The actual value read at addr is returned.

There are various types of memory on a GPU. The global memory is the largest of these, 24 GB in the case of the Titan RTX, and is used to copy data between the host (CPU-side) and the device (GPU-side). It can be accessed by all GPU threads, and has a high bandwidth, but also a high latency. Having many threads executing a kernel helps to hide this latency; the cores can rapidly switch contexts to interleave the execution of multiple threads, and whenever a thread is waiting for the result of a memory access, the core uses that time to execute another thread. Another way to improve memory access times is by ensuring that the accesses of a warp are coalesced: if the threads in a warp try to fetch a consecutive block of memory in size not larger than the cache line (128 bytes for a Titan RTX), then the time needed to access that block is the same as the time needed to access an individual memory address.

Other types of memory are shared memory and registers. Shared memory is fast on-chip memory with a low latency, that can be used as block-local memory; the threads of a block can share data with each other via this memory. In a Titan RTX, each block can use up to 49,152 bytes of shared memory. Register memory is the fastest, and is used to store thread-local data. It is very small, though, and allocating too much memory for thread-local variables may result in data spilling over into global memory, which can dramatically limit the performance.

Finally, the threads in a warp can communicate very rapidly with each other by means of intra-warp instructions. There are various instructions, such as shuffle to distribute register data among the threads and ballot to distribute the results of evaluating a predicate. Since CUDA 9.0, threads can be partitioned into cooperative groups. If these groups have a size that completely divides the warp size, i.e., it is a power of two smaller than or equal to 32, then the threads in a group can use intra-warp instructions among themselves.

In Section 2, we mentioned the use of buckets in a GPU hash table. When a hash table is divided into buckets, each containing 1 < n ≤ 32 elements, that still fit in the cache line, then cooperative groups of n threads each can be created, and the threads in a group can work together for the fetching and updating of buckets. This results in more coalesced memory accesses and reduces thread divergence. However, it also means that fewer tasks can be performed in parallel, and starting with the Turing architecture (2018), which the Titan RTX is built on, NVIDIA has been working on making computations less reliant on coalesced memory accessing.

## 4 GPU state space exploration

Slco. For this work, we extended the state space exploration engine of GPUexplore 2.0 [53] to support models of finite-state concurrent systems written in the Simple Language of Communicating Objects (Slco), version 2.0 [44]. An Slco model consists of a finite number of FSMs. The FSMs can communicate via globally shared variables, and each FSM can have its own local variables. Variables can be of type Bool, Byte and (32-bit) Integer, and there is support for arrays of these types. We refer with (system) states s, s<sup>0</sup> , . . . to entire states of the system, and with FSM states σ, σ<sup>0</sup> , . . . to the states of an individual FSM. A system state is essentially a vector, containing all the information that together defines a state of the system, i.e., the current states of the FMSs and the values of the variables.

An FSM transition tr = σ st −→ σ 0 indicates that the FSM can change state from σ to σ 0 iff the associated statement st is enabled. A statement is either an assignment, an expression or a composite. Each can refer to the variables in the scope of the FSM. An assignment is always enabled, and assigns a value to a variable, an expression is a predicate that acts as a guard: it is enabled iff it evaluates to true. Finally, a composite is a finite sequence of statements st0; . . . ; stn, with st<sup>0</sup> being either an expression or an assignment, and st1, . . . , st<sup>n</sup> being assignments. A composite is enabled iff its first statement is enabled. A transition tr = σ st −→ σ 0 can be fired if it is enabled, which results in the FSM atomically moving from state σ to state σ 0 , and any assignments of st being executed in the specified order. When tr is fired while the system is in a state s, then after firing, the system is in state s 0 , which is equal to s, apart from the fact that σ has been replaced by σ 0 , and the effect of st has been taken into account. We call s <sup>0</sup> a successor of s.

The formal semantics of Slco defines that each transition is executed atomically, i.e., cannot be interrupted by the execution of other transitions. The FSMs execute concurrently, using an interleaving semantics. Finally, the FSMs may have non-deterministic behaviour, i.e., at any point of execution, an FSM may have several enabled transitions.

State space exploration. Given an Slco model with n FSMs, first, CUDA functions f1, . . . f<sup>n</sup> are generated, using a new code generator, that take as input a state s, and produce as output the successors of s which can be reached by firing a transition enabled in s of the i th FSM. When the state space is generated, each state s can be analysed in parallel by n threads t1, . . . , tn, where each t<sup>i</sup> executes f<sup>i</sup> to obtain some of the successors of s.

Fig. 1 presents how the different components of the state space exploration engine map on a GPU. We explain how the engine works insofar is needed. For more details, we refer the reader to [50, 51, 53]. Even though the type of input model has changed, as GPUexplore only supports models without data variables, the core of the engine has remained the same.

In the global memory, a large hash table (we call it G) is maintained to store the states visited so far. At the start, the initial state of the input model is stored in G. Each state in G has a Boolean flag new, indicating whether the state has already been explored, i.e., whether or not its successors have been constructed.

On the right in Fig. 1, the state space exploration algorithm is explained from the perspective of a thread block. While the block can find unexplored states in G, it selects some of those for exploration. In fact, every block has a work tile residing in its shared memory, of a fixed size, which the block tries to fill with unexplored states at the start of each exploration iteration. Such an iteration is initiated on the host side by launching the exploration kernel. States are marked as explored when added by threads to their tile.

Next, every block processes its tile. For this, each thread in the block is assigned to a particular state/FSM combination. Each thread accesses its designated state in the tile, and analyses the possibilities for its designated FSM to change state, as explained before. Hence, the threads in a group can generate successors for a single state in parallel.

The generated successors are stored in a block-local state cache, which is a hash table in the shared memory. This avoids repeated accessing of global memory, and local duplicate detection filters out any duplicate successors generated at the block-level. Once the tile has been processed, the threads in the block together scan the cache once more, and store the new states in G if they are not already present. When states require no more than 32 or 64 bits in total (including the new flag), they can simply be stored atomically in G using compare-and-swap. However, sufficiently large systems have states consisting of more than 64 bits. In this paper, we therefore focus on working with these larger states, and consider storing them as binary trees.

Fig. 2: An example of storing state vectors as binary trees.

## 5 A Compact GPU Tree Database

#### 5.1 CPU Tree Storage

The number of data variables in a model, and their types, can have a drastic effect on the size of the states of that model. For instance, each 32-bit integer variable in a model requires 32 bits in each state. As the amount of global memory on a GPU is limited, we need to consider techniques to store states in a memory-efficient way. One technique that has proven itself for CPU-based model checkers is tree compression [7], in which system states are stored as binary trees. A single hash table can be used to store all tree nodes [27]. Compression is achieved by having the trees share common subtrees. Its success relies on the observation that states and their successors tend to be different in only a few data elements. In [27], it is experimentally assessed that tree compression compresses better than any other compression technique identified by the authors for explicit state space exploration. They observe that the technique works well for a multi-threaded exploration engine. Moreover, they propose an incremental variant that has a considerably improved runtime performance, as it reduces the number of required memory accesses to a number logarithmic in the length of the state vector.

Fig. 2 shows an example of applying tree compression to store four state vectors. The black circles should be ignored for now. Each letter represents a part of the state vector that is k bits in length. We assume that in k bits, also a pointer to a node can be stored, and that each node therefore consists of 2k bits. The vector <A,B,C,D,E> is stored by having a root node with a left leaf sibling <A,B>, and the right sibling being a non-leaf that has both a left leaf sibling <C,D>, and the element E. In total, storing this tree requires 8k bits. To store the vector <A',B,C',D,E>, we cannot reuse any of these nodes, as <A',B> and <C',D> have not been stored yet. This means that all pointers have to be updated as well, and therefore, a new root and a new non-leaf containing E are needed. Again, 8k bits are needed. For <A,B',C,D,E'>, we have to store a new node <A,B'> and a new root, and a new non-leaf storing E', but the latter can point to the already existing node <C,D>. Hence, only 6k bits are needed to store this vector. Finally, for <A',B,C,D,E'>, we only need to store a new root node, as all other nodes already exist, resulting in only needing 2k bits. It has been demonstrated that as more and more state vectors are stored, eventually new vectors tend to require 2k bits each [26, 27].

To emphasise that GPU tree compression has to be implemented vastly differently from the typical CPU approach, we first explain the latter, and the incremental approach [27]. Checking for the presence of a tree and storing it if


not yet present is typically done by means of recursion (outlined by Alg. 1). For now, ignore the red underlined text. The store function returns the address of the given node in G, if present, otherwise it stores the node and returns its address, and the findorput-cpu function first recursively checks whether the siblings of the node are stored, and if not, stores them, after which the node itself is stored. A node has pointers left and right to addresses of G, and there are functions to check for the existence of, and retrieve the siblings of a node.

In the incremental approach, when creating a successor s <sup>0</sup> of a state s, the tree for s, say T(s), is used as the basis for the tree T(s 0 ). When T(s 0 ) is created, each node inside it is first initialised to the corresponding node in T(s), and the leaves are updated for the new tree. This 'updated' status propagates up: when a non-leaf has an updated sibling, its corresponding G pointer must be updated when T(s 0 ) is stored in G, but for any non-updated sibling, the non-leaf can keep its G pointer. When incorporating the red underlined text in Alg. 1, the incremental version of the function is obtained. With this version, tree storage often results in fewer calls to store, i.e., fewer memory accesses.

There are two main challenges when considering GPU incremental tree storage: 1) Recursion is detrimental to performance, as call stacks are stored in global memory (and with thousands of threads, a lot of memory would be needed for call stacks), and 2) The nodes of a tree tend to be spread all over the hash table, potentially leading to many random accesses. To address these, we propose a procedure in which threads in a block store sets of trees together in parallel.

#### 5.2 GPU Tree Generation

When states are represented by trees, the tile of each thread block cannot store entire states, but it can store the roots of trees. To speed up successor generation, and avoid repeated uncoalesced global memory accessing, the trees of those roots are retrieved and stored in the shared memory (state cache) by the thread block. Once this has been done, successor generation can commence.

Fig. 3 shows an example of the state cache evolving over time as a thread generates the successor s <sup>0</sup> =<A,B',C,D,E'> of s =<A,B,C,D,E>, with the trees as in Fig. 2. Each square represents a k-bit cache entry. In addition to two entries needed to store a node, we also use one (grey) entry to store two cache pointers or indices, and assume that k bits suffice to store two pointers (in practice, we use k = 32, which is enough, given the small size of the state cache). Hence, every pair of white squares followed by a grey square constitutes one cache slot. Initially (shown at the top of the figure), the tile has a cache pointer to the root of s, of

Fig. 3: Successor generation: deriving <A,B',C,D,E'> from <A,B,C,D,E>.

which we know that it contains the G addresses a<sup>0</sup> and a<sup>1</sup> to refer to its siblings. In turn, this root points, via its cache pointers, to the locally stored copies of its siblings. The non-leaf one contains the global address a2. A leaf has no cache pointers, denoted by '-'. When creating s , first, the designated thread constructs the leaf <A,B'>, by executing the appropriate generated CUDA function (see Section 4), and stores it in the cache. In Fig. 3, it is coloured black, to indicate that it is marked as new. Next, the thread creates a copy of <a2,E>, together with its cache pointers, and updates it to <a2,E'>. Finally it creates a new root, with cache pointers pointing to the newly inserted nodes. This root still has global address gaps to be filled in (the '?' marks), since it is still unknown where the new nodes will be stored in G.

The reason that we store global addresses in the cache is not to access the nodes they point to, but to achieve incremental tree storage: in the example, as the global address a<sup>2</sup> is stored in the cache, there is no need to find <C,D> in G when the new tree is stored; instead, we can directly construct <a2,E'>. This contributes to limiting the number of required global memory accesses.

Note that there is no recursion. Given a model, the code generator determines the structure of all state trees, and based on this, code to fetch all the nodes of a tree and to construct new trees is generated. As we do not consider the dynamic creation and destruction of FSMs, all states have the same tree structure.

#### 5.3 GPU Tree Storage at Block Level

Once a block has finished generating the successors of the states referred to by its tile, the state cache content must be synchronised with G. Alg. 2 presents how this is done. The findorput-many function is executed by all threads in the block simultaneously. It consists of an outer while-loop (l.5-28), that is executed as long as there is work to be done. The code uses a cooperative group called bg, which is created to coincide with the size of a bucket (bucketsize). When no buckets are used, these groups can be interpreted as consisting of only a single thread each. At l.4, the offset of each thread is determined, i.e., its ID inside its group, ranging from 0 to the size of the group.

Every thread that still has work to do (l.5) enters the for-loop of l.7-27, in which the content of the state cache is scanned. The parallel scanning works as follows: every thread first considers the node at position tid − offset of the cache, with tid being the thread's block-local ID. This node is assigned to the thread with bg ID 0. If that index is still within the cache limits, all threads of

```
Algorithm 2: Tree-based Find-or-put-many, at thread block level.
```

```
1 device function findorput-many(node t* G):
2 node t p, q; index t addr; bool work to do ← true; bool ready; byte ballot result
3 auto bg ← tiled-partitionhbucketsizei(this-thread-block())
4 byte offset ← bg.thread-rank()
5 while work to do do
6 work to do ← false
7 for i ← tid − offset; i < CACHE SIZE; i ← i + BLOCK SIZE do
8 ready ← false
9 if i + offset < CACHE SIZE then
10 p ← cache[i + offset]
11 if is-new-leaf(p) then ready ← true
12 else if is-new-nonleaf(p) then
13 if left-gap(p) then
14 cache[i + offset] ← set-left-gaddr(p, cache[left-caddr(p)])
15 if right-gap(p) then
16 cache[i + offset] ← set-right-gaddr(p, cache[right-caddr(p)])
17 if ¬(left-or-right-gap(p)) then ready ← true
18 else work to do ← true
19 ballot result ← bg.ballot(ready)
20 while ballot result do
21 lane ← find-first-set(ballot result) - 1; q ← bg.shuffle(p, lane)
22 addr ← findorput-single(bg, G, q)
23 if offset = lane then
24 ready ← false
25 if addr = FULL then signal hash table full
26 else set-gaddr(cache[i], addr)
27 ballot result ← bg.ballot(ready)
28 work to do ← bg.ballot(work to do)
```
bg have to move along, regardless of whether they have a node to check or not. At the next iteration of the for-loop, the thread jumps over BLOCK SIZE nodes as long as the index is within the cache limits.

The main goal of this loop is to check which nodes are ready for synchronisation with G. Initially, this is the case for all nodes without global address gaps (see Subsection 5.2). Each thread first checks whether its own index is still within the cache limits (l.9). If so, the node p is retrieved from the cache at l.10. If it is a new leaf, ready is set to true, to indicate that the active thread is ready for storage (l.11). If the node is a new non-leaf (l.12), it is checked whether the node still has global address gaps. If it has a gap for the left sibling (l.13), this left sibling is inspected via the cache pointer to this sibling (retrieved with the function left-caddr (l.14)). The function set-left-gaddr checks whether the cache pointers of that sibling have been replaced by a global memory address, and if so, uses that address to fill the gap. The same is done for the right sibling at l.15-16. If, after these operations, the node p contains no gaps (l.17), ready is set to true. If the node still contains a gap, another loop iteration is required, hence work to do is set to true (l.18).

At l.19, the threads in the group perform a ballot, resulting in a bit sequence indicating for which threads ready is true. As long as this is the case for at least one thread, the while-loop at l.20-27 is executed. The function find-firstset identifies the least significant bit set to 1 in ballot result (l.21), and the shuffle instruction results in all threads in bg retrieving the node of the corresponding bg thread. This node is subsequently stored by bg, by calling findorput-single (l.22) (explained later). Finally, the thread owning the node (l.23) resets its ready flag (l.24), and if the hash table is considered full, reports this globally (l.25). Otherwise, it records the global address of the stored node (l.26). After that, ballot result is updated (l.27). Finally, once the for-loop is exited, the bg threads determine whether they still have more work to do (l.28).

#### 5.4 Single Node Storage at Bucket Group Level

In this section, we address how individual nodes are stored by a cooperative group bg. Before we explain the algorithm for this, Alg. 3, in detail, we consider our options for hashing, and propose a novel combination of existing techniques.

In Section 2, we argued that Cuckoo hashing is very effective on a GPU. However, as it frequently moves elements, it is not suitable for a single hash table, since the non-leaves of a tree refer to the positions of other nodes. We address this by maintaining two hash tables, one for tree roots, and one for the other nodes, as done in [26]. The roots are then not referred to, and hence Cuckoo hashing can be applied on the root table.

In fact, when using two hash tables, we can be even more memory-efficient. In [26], it was shown that Cleary tables [13, 15] can be very effective to store state spaces. To handle collisions in Cleary tables, order-preserving bidirectional linear probing [2] is used, which involves moving nodes to preserve their order. This makes Cleary tables, like Cuckoo hashing, not suitable to store entire trees, but they can be used to store the roots of the trees. In a Cleary table for roots of size 2k, each root r is hashed (bit scrambled) with a hash function h to a 2k bit sequence, from which w < k bits are taken to be used as the address to store r in a table with exactly 2<sup>w</sup> buckets, and at this position, the remaining 2k − w bits (the remainder ) are actually stored. To enable decompression, h must be invertible; given a remainder and an address, h −1 can be applied to obtain r.

In a multi-threaded CPU context, this approach scales well [26], but the parallel approach of [26, 45] divides a Cleary table into regions, and sometimes, a region must be locked by a thread to safely reorder nodes. Unfortunately, the use of any form of locking, also fine-grained locking implemented with atomic operations, is detrimental for GPU performance. Further, the absence of coherent caches in GPUs means that expensive global memory accesses may be needed when a thread repeatedly checks the status of an acquired lock.

As an elegant alternative, we propose Cleary-Cuckoo hashing, which combines Cleary compression with Cuckoo hashing. We use m hash functions that are invertible (as with Cuckoo hashing) and capable of scrambling the bits of a root to a 2k bit sequence (as in Cleary tables). When we apply a function h<sup>i</sup> (0 ≤ i < m) on a root r, we get a 2k bit sequence, of which we use w bits for an address d, and store at d the remainder r 0 consisting of 2k − w + dlog<sup>2</sup> (m)e + 1 bits. The dlog<sup>2</sup> (m)e bits are needed to store the ID of the used hash function (i), and the final bit is needed to indicate that the root is new (unexplored). It is possible to retrieve r by applying h −1 i on d and r <sup>0</sup> without the hash function ID and the new bit. When a collision occurs, the encountered root is evicted,

```
Algorithm 3: Single node find-or-put, at bucket group level.
```

```
1 device function index t findorput-single(tile t bg, node t* G, node t p):
2 node t q; index t addr
3 (q, addr) ← fop-cuckoo-root(bg, G, p)
4 for i ← 0; q 6= p and i < MAX EVICT; i ← i + 1 do
5 (q, addr) ← fop-cuckoo-root(bg, G, q)
6 return (i = MAX EVICT? FULL; addr)
7 device function (node t, index t) fop-cuckoo-root(tile t bg, node t* G, node t p):
8 comprnode t cp, cq; node t q
9 hs ← get-hash-start(p); byte offset ← bg.thread-rank()
10 for i ← 0; i < NUM HASH FUNCTIONS; i ← i + 1 do
11 (addr, cp) ← addr-compr-root(p, h(hs+i) mod NUM HASH FUNCTIONS)
12 (cq, pos) ← ht-find(bg, offset, G, addr, cp)
13 if cq = cp then return (p, addr + pos)
14 if cq = EMPTY then
15 hs ← h(hs+i) mod NUM HASH FUNCTIONS
16 break
17 if i = NUM HASH FUNCTIONS then (cp, addr) ← addr-compr-root(p, hs)
18 (cq, pos) = ht-insert-cuckoo(bg, offset, G, addr, cp)
19 if cq 6= EMPTY and cq 6= cp then
20 q ← get-decompr-root(cq, addr)
21 return (q, addr + pos)
22 return (p, addr + pos)
```
decompressed, and stored again using the hash function next in line for that root. We refer to the application of Cleary compression to roots as root compression.

Alg. 3 presents one version of the findorput-single function, to which a call in Alg. 2 is redirected when a root is provided. Here, G is a Cleary-Cuckoo table that is only used to store roots. In findorput-single, a second function fop-cuckoo-root (l.7-22) is called repeatedly, as long as nodes are evicted or until the pre-configured MAX EVICT has been reached, which prevents infinite eviction sequences (l.4). The function fop-cuckoo-root returns the address where the given node was found or stored, and a node, which is either the node that had to be inserted or the one that was already present.

In the fop-cuckoo-root function, lines highlighted in purple are specific for root compression, i.e., Cleary compression of roots, while the green highlighted lines concern Cuckoo hashing, addressing node eviction. The ID of the first hash function to be used for node p, encoded in p itself, is stored in hs (l.9), and each thread determines its bg offset. Next, the thread iterates over the hash functions, starting with function hs (l.10-16). The G address and node remainder are computed at l.11. If the node is new, the remainder is marked as new. If root compression is not used, we have p = cp. Then, the function ht-find is called to check for the presence of the remainder in the bucket starting at addr (l.12). If ht-find returns the remainder, then it was already present (l.13), and this can be returned. Note that the returned address is (addr + pos), i.e., the offset at which the remainder can be found inside the bucket is added to addr. Alternatively, if EMPTY is returned, the node is not present and the bucket is not yet full. In this case, a bucket has been found where the node can be stored. The used hash function is stored in hs (l.15) and the for-loop is exited (l.16).

At l.17, if a suitable bucket for insertion has not been found, the initial hs is selected again. At l.18, the function ht-insert-cuckoo is called to insert cp.

#### Algorithm 4: Single node insertion, at bucket group level.


This function is presented in Alg. 4. Finally, if a value other than the original remainder cp or EMPTY is returned, another (remainder of a) node has been evicted, which is decompressed and returned at l.20-21. Otherwise, p is returned with its address (l.22). When Cuckoo hashing is not used, evictions do not occur, and at l.20-21, it is returned that the bucket is full.

Finally, we present ht-insert-cuckoo in Alg. 4. The function ht-find is not presented, but it is almost equal to l.2-3 of Alg. 4. At l.2, each thread in bg reads its part of the bucket G[addr + offset], and checks if it contains cp, the remainder of p. If it is found anywhere in the bucket, the remainder with its position is returned (l.3). In the while-loop at l.4-9, it is attempted to insert cp in an empty position. In every iteration, an empty position is selected (l.5) and the corresponding thread tries to atomically insert cp (l.6). At l.7, the outcome is shared among the threads. If it is either EMPTY or the remainder itself, it can be returned (l.8). Otherwise, the bucket is read again (l.9). If insertion does not succeed, l.10 is reached, where a hash function is used by get-eviction-pos to hash cp to a bucket position. The corresponding thread exchanges cp with the node stored at that position (l.11). After the evicted node has been shared with the other threads (l.12), it is returned together with its position (l.13).

## 6 Experiments

We implemented a code generator in Python, using textX [17] and Jinja2, 3 that accepts an Slco model and produces CUDA C++ code to explore its state space. The code is compiled with CUDA 11.4 targeting compute capability 7.5. Experiments were conducted on a machine running Linux Mint 20 with a 4-core Intel Core i7-7700 3.6 GHz, 32GB RAM, and a Titan RTX GPU.

The goal of the experiments is to assess how fast GPU next state computation with the tree database is w.r.t. 1) the various options we have for hashing, 2) state-of-the-art CPU tools, and 3) other GPU tools. For 2), we compare with multi-core Depth-First Search (DFS) of Spin 6.5.1 [22] and (explicit-state) multicore Breadth-First Search (BFS) of LTSmin 3.0.2 [24, 28].

<sup>3</sup> https://palletsprojects.com/p/jinja/.

Fig. 4: Speed obtained by different GPU configurations.

In our implementation, we use 32 invertible hash functions. Root compression (cmp) can be turned on or off. When selected, we have a root table with 2<sup>32</sup> elements, 32 bits each, and a non-root table with 2<sup>29</sup> elements, 64 bits each. This enables storing 58-bit roots (two pointers to the non-root table) in 58 − 32 + log2(32) + 1 = 32 bits. When using buckets with more than one element (cmp+bu), we have root buckets of size 8, and non-root buckets of size 16. The non-root buckets make full use of the cache line, but the root buckets do not. Making the latter larger means that too many bits for root addressing are lost for root compression to work (the remainders will be too large).

Root compression allows turning Cuckoo hashing on (cmp(+bu)+cu) or off (cmp(+bu)). When it is off, essentially Cleary-Cuckoo is still performed, except that evictions are not allowed, meaning that hashing fails as soon as all possible 32 buckets for a node are occupied.

In the configuration bu, neither root compression nor Cuckoo hashing is applied. We use one table with 2<sup>30</sup> 64-bit elements and buckets of size 16. For reasons related to storing global addresses in the state cache, we cannot make the table larger. The 32 hash functions are used without allowing evictions.

Finally, multiple iterations can be run per kernel launch. Shared memory is wiped when a kernel execution terminates, but the state cache content can be reused from one iteration to the next when a kernel executes multiple iterations, by which trees already in the cache do not need to be fetched again from the tree database. We identified 30 iterations to be effective in general (i30), and experimented with a single iteration per kernel launch (i1).

With the CPU tools, we performed reachability analysis on 1- and 4-core configurations, denoted by Sp-1 and Sp-4 for Spin, and Lm-1 and Lm-4 for LTSmin. We only enabled state compression and basic reachability (without property checking), to favour fast exploration of large state spaces.

For benchmarks, we used models from the Beem benchmarks [42] of concurrent systems, translated to Slco and Promela (for Spin). We scaled some of them up to have larger state spaces. Those are marked in Table 1 with '+'. Timeout is set to 3600 seconds for all benchmarks.

Table 1: Millions of states per second for various reachability tools and configurations. Pink cells: out of memory. Yellow cells: timeout. Green cell: best average. o.m.: out of memory at initialisation. SU: speedup of (cmp + i30) vs. (Lm-1).


Fig. 4 compares the speeds of the different GPU configurations in millions of states per second, averaged over 5 runs. For each configuration, we sorted the data to observe the overall trend. The higher the speed the better. The cmp + i30 mode (without Cuckoo hashing or larger buckets) is the fastest for the majority of models. On the other hand, it fails to complete exploration for at.8, the largest state space with 3.7 billion states, due to running out of memory. If Cuckoo hashing is enabled with root compression, all state spaces are successfully explored, which confirms that higher load factors can be achieved [4]. However, Cuckoo hashing negatively impacts performance, which contradicts [4]. Although it is difficult to pinpoint the cause for this, it is clear that it results from our hashing being done in addition to the exploration tasks, while in papers on GPU hash tables [1, 4], hashing is analysed in isolation. With the extra variables and operations needed for exploration, hashing should be lightweight, and Cuckoo hashing introduces handling evictions. The more complex code is compiled to a less performant program, even when evictions do not occur.

Table 1 compares GPU performance with Spin and LTSmin. We refer to our tool as GPUexplore + Slco. From the results of Fig. 4, we selected a set of configurations demonstrating the impact of the various options. For each model, Bits and CR gives the state vector length in bits and the compression ratio, defined as (number of roots × number of leaves per tree) / (number of nodes). With the compression ratio, we measure how effective the node sharing is, compared to if we had stored each state individually without sharing. In


Table 2: Millions of states per second for various GPU tools.

addition, the speed in millions of states per second is given. Regarding out of memory, we are aware that Spin has other, slower, compression options, but we only considered the fastest, to favour the CPU speeds. Times are restricted to exploration; code generation and compilation always take a few seconds. The best GPU results are highlighted in bold. To compute the speedup (SU), the result of cmp + i30, the overall best configuration, has been divided by the Lm-1 result (the single-core configuration that completely explored all state spaces except one). All GPU experiments have been done with 512 threads per block, and 3,240 blocks (45 blocks per SM). We identified this configuration as being effective for anderson.6, and used it for all models.

While LTSmin tends to achieve near-linear speed-ups (compare Lm-1 and Lm-4), the speed of GPUexplore + Slco heavily depends on the model. For some models, as the state spaces of instances become larger, the speed increases, and for others, it decreases. The exact cause for this is hard to identify, and we plan to work on further optimisations. For instance, the branching factor, i.e., average number of successors of a state, plays a role here, as large branching factors favour parallel computation (many threads will become active quickly).

Our overall fastest configuration does not use larger buckets, nor Cuckoo hashing. Regarding buckets, as already noted in Section 3, starting with the Turing architecture, NVIDIA GPUs are less sensitive to uncoalesced accesses, and our results confirm that. Performing fewer tasks in parallel seems to be more harmful for performance than a larger number of uncoalesced accesses.

Finally, Table 2 compares GPUexplore + Slco with GPUexplore 2.0 and Grapple. A comparison with ParaMoc was not possible, as it targets very different types of (sequential) models. The models we selected are those available for at least two of the tools we considered. Unfortunately, Grapple does not (yet) support reading Promela models. Instead, a number of models are encoded directly into its source code, and we were limited to checking only those models. It can be observed that in the majority of cases, our tool achieves the highest speeds, which is surprising, as the trees we use tend to lead to more global memory accesses, but it is also encouraging to further pursue this direction.

## 7 Conclusions and Future Work

We discussed new algorithms to achieve a GPU tree database, which enables memory-efficient explicit state space exploration for FSMs with data. We proposed Cleary-Cuckoo hashing, which makes it possible to use, for the first time, Cleary compression on GPUs. Experiments show processing speeds of up to 131 million trees per second. In the last decade, new GPUs have been increasingly effective for state space exploration [10], and in the future, they are expected to be more capable of handling thread divergence, which still heavily occurs when accessing G. Therefore, we are optimistic about further improvements. In the future, we will focus on optimisations and verifying temporal logic formulae.

Data Availability Statement. The datasets generated and analysed during the current study are available in the Zenodo repository [39].

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

#### **A**

Abdulla, Parosh Aziz I-588 Abdulla, Parosh I-105 Aggarwal, Saksham I-666 Agrawal, Sakshi II-588 Albert, Elvira I-448 Aljaafari, Fatimah II-541 Amir, Guy I-607 Anand, Ashwani II-211 Andreotti, Bruno I-367 Apinis, Kalmer II-453 Atig, Mohamad Faouzi I-588 Atig, Mohamed Faouzi I-105 Avigad, Jeremy II-74 Ayaziová, Paulína II-523

#### **B**

Bach, Jakob I-407 Bajwa, Ali I-308 Balachander, Mrudula II-309 Banerjee, Anindya II-133 Barbosa, Haniel I-367 Barrau, Florian II-3 Barth, Max II-577 Bassan, Shahaf I-187 Batz, Kevin II-410 Bentkamp, Alexander II-74 Beutner, Raven I-145 Beyer, Dirk II-152, II-495 Biere, Armin I-426 Blanchette, Jasmin II-111 Bonakdarpour, Borzoo I-29, I-66 Bouma, Jelle II-19 Bruyère, Véronique I-271

#### **C**

Cadilhac, Michaël II-192 Chadha, Rohit I-308

Chakraborty, Supratik II-588 Chalupa, Marek II-535 Chatterjee, Krishnendu I-3 Chen, Mingshuai II-410 Chien, Po-Chun II-152 Chimdyalwar, Bharti II-588 Chin, Wei-Ngan I-569 Cimatti, Alessandro II-3 Cooper, Martin C. I-167 Cordeiro, Lucas C. II-541 Corfini, Sara II-3 Correas, Jesús I-448 Corsi, Davide I-607 Cortes, João II-55 Cristoforetti, Luca II-3

#### **D**

Darke, Priyanka II-588 de Gouw, Stijn II-19 de la Banda, Alejandro Stuckey I-666 de Pol, Jaco van II-353 Deligiannis, Pantazis II-433 Denis, Xavier II-93 Di Natale, Marco II-3 Dietsch, Daniel II-577, II-582 Dimitrova, Rayna II-251 Doveri, Kyveli I-290 Duan, Zhenhua II-571

#### **E**

Erhard, Julian II-547 Ernst, Gidon II-559 Etman, L. F. P. II-44 Eugster, Patrick I-126

#### **F**

Fang, Wenji II-11 Farinelli, Alessandro I-607

© The Editor(s) (if applicable) and The Author(s) 2023 S. Sankaranarayanan and N. Sharygina (Eds.): TACAS 2023, LNCS 13993, pp. 705–708, 2023. https://doi.org/10.1007/978-3-031-30823-9

706 Author Index

Fedyukovich, Grigory II-270 Fichtner, Leonard II-577 Filiot, Emmanuel II-309 Finkbeiner, Bernd I-29, I-145 Fokkink, W. J. II-44 Fuchs, Tobias I-407 Furbach, Florian I-588

#### **G**

Ganty, Pierre I-290 Godbole, Adwait A. I-588 Goorden, M. A. II-44 Gordillo, Pablo I-448 Griggio, Alberto II-3 Guo, Xingwu I-208 Gupta, Ashutosh I-105 Gutierrez, Julian I-666

#### **H**

Hadži-Ðoki´c, Luka I-290 Hahn, Ernst Moritz I-527 Hamza, Ameer II-270 Harel, David I-607 Hartmanns, Arnd I-469 Havlena, Vojtˇech I-249 Heim, Philippe II-251 Heisinger, Maximilian I-426 Heizmann, Matthias II-577, II-582 Hendi, Yacoub G. I-588 Hendriks, D. II-44 Henzinger, Thomas A. I-3, II-535 Herasimau, Andrei II-473 Heule, Marijn J. H. I-329, I-348, I-389 Hoenicke, Jochen II-577 Hofkamp, A. T. II-44 Hsu, Tzu-Han I-29, I-66 Huang, Xuanxiang I-167 Hussein, Soha II-553

#### **I**

Iser, Markus I-407

#### **J**

Jaber, Nouraldin II-289 Jacobs, Swen II-289 Jakobsen, Anna Blume II-353 Jansen, Nils I-508 Jongmans, Sung-Shik II-19

Jourdan, Jacques-Henri II-93 Junges, Sebastian I-469, I-508, II-410

#### **K**

Kaminski, Benjamin Lucien II-410 Karmarkar, Hrishikesh II-594 Katoen, Joost-Pieter II-391, II-410 Katz, Guy I-187, I-208, I-607 Kiesl-Reiter, Benjamin I-329, I-348 Klumpp, Dominik II-577, II-582 Kobayashi, Naoki I-227 Kokologiannakis, Michalis I-85 Konnov, Igor I-126 Korovin, Konstantin I-647 Kovács, Laura I-647 Krishna, S. I-105 Krishna, Shankara N. I-588 Kukovec, Jure I-126 Kulkarni, Milind II-289 Kullmann, Oliver II-372 Kumar, Shrawan II-588

#### **L**

Lachnitt, Hanna I-367 Lal, Akash II-433 Larsen, Casper Abild II-353 Lechner, Mathias I-3 Lee, Nian-Ze II-152 Lefaucheux, Engel I-47 Lengál, Ondˇrej I-249 Lester, Martin Mariusz II-173 Li, Jianwen II-36 Li, Yong I-249 Lima, Leonardo II-473 Lovett, Chris II-433 Lynce, Inês II-55

#### **M**

Malík, Viktor II-529 Mallik, Kaushik II-211 Manino, Edoardo II-541 Manquinho, Vasco II-55 Marmanis, Iason I-85 Marques-Silva, Joao I-167 Marzari, Luca I-607 Matheja, Christoph II-410 McCamant, Stephen II-553 Medicherla, Raveendra Kumar II-594 Meggendorfer, Tobias I-489

Author Index 707

Melham, Tom I-549 Menezes, Rafael II-541 Metta, Ravindra II-594 Meyer, Roland I-628 Michaelson, Dawn I-348 Miné, Antoine II-565 Mir, Ramon Fernández II-74 Monat, Raphaël II-565 Moormann, L. II-44 Morgado, Antonio I-167

#### **N**

Nagasamudram, Ramana II-133 Naouar, Mehdi II-577 Naumann, David A. II-133 Nayak, Satya Prakash II-211 Nayyar, Fahad II-433 Neˇcas, František II-529

#### **O**

Osama, Muhammad I-684 Otoni, Rodrigo I-126 Ouadjaout, Abdelraouf II-565 Ouaknine, Joël I-47

## **P**

Pai, Rekha I-549 Park, Seung Hoon I-549 Pavlogiannis, Andreas II-353 Pérez, Guillermo A. I-271 , II-192 Perez, Mateo I-527 Pietsch, Manuel II-547 Planes, Jordi I-167 Podelski, Andreas II-577 , II-582 Pu, Geguang II-36 Purser, David I-47

#### **Q**

Quatmann, Tim I-469

#### **R**

Raskin, Jean-François II-309 Raszyk, Martin II-473 Reeves, Joseph E. I-329 Reger, Giles I-647 Reijnen, F. F. H. II-44

Reniers, M. A. II-44 Román-Díez, Guillermo I-448 Rooda, J. E. II-44 Rubio, Albert I-448

### **S**

Saan, Simmo II-547 Samanta, Roopsha II-289 Sánchez, César I-29 , I-66 Sankur, Ocan II-28 , II-329 Schewe, Sven I-527 Schiffelers, R. R. H. II-44 Schindler, Tanja II-577 Schmidt, Simon Meldahl II-353 Schmuck, Anne-Kathrin II-211 Schoisswohl, Johannes I-647 Schrammel, Peter II-529 Schreiber, Dominik I-348 Schulz, Stephan II-111 Schüssele, Frank II-577 , II-582 Schwarz, Michael II-547 Seidl, Helmut II-547 Seidl, Martina I-426 Senthilnathan, Aditya II-433 Sharifi, Mohammadamin I-47 Sharma, Vaibhav II-553 Sharygina, Natasha I-126 Sheinvald, Sarai I-66 Shmarov, Fedor II-541 Shukla, Ankit II-372 Šmahlíková, Barbora I-249 Somenzi, Fabio I-527 Song, Yahui I-569 Spengler, Stephan I-588 Staquet, Gaëtan I-271 Steensgaard, Jesper II-353 Strejˇcek, Jan II-523 Su, Jie II-571 Subercaseaux, Bernardo I-389

#### **T**

Thomas, Bastien II-28 Thuijsman, S. B. II-44 Tian, Cong II-571 Tilscher, Sarah II-547 Tonetta, Stefano II-3

Traytel, Dmitriy II-473 Trivedi, Ashutosh I-527 Tuppe, Omkar I-105 Turrini, Andrea I-249

#### **V**

Vafeiadis, Viktor I-85 van Beek, D. A. II-44 van de Mortel-Fronczak, J. M. II-44 van der Sanden, L. J. II-44 van der Vegt, Marck I-508 Venkatesh, R II-588 Verbakel, J. J. II-44 Viswanathan, Mahesh I-308 Vogel, J. A. II-44 Vojdani, Vesal II-453, II-547 Vojnar, Tomáš II-529 Voronkov, Andrei I-647 Vukmirovi´c, Petar II-111

#### **W**

Wagner, Christopher II-289 Wang, Yuning II-229 Weininger, Maximilian I-469 Whalen, Michael W. I-348, II-553 Wies, Thomas I-628 Wijs, Anton I-684

Winkler, Tobias II-391 Wojtczak, Dominik I-527 Wolff, Sebastian I-628 Wu, Minchao I-227

#### **X**

Xiao, Shengping II-36 Xing, Hengrui II-571

#### **Y**

Yan, Qiuchen II-553 Yang, Jiyu II-571 Yang, Luke I-666 Yang, Zuchao II-571 Yeduru, Prasanth II-594 Yerushalmi, Raz I-607 Yuan, Simon II-473

#### **Z**

Zhang, Chengyu II-36 Zhang, Hongce II-11 Zhang, Min I-208 Zhang, Minjian I-308 Zhang, Yueling I-208 Zhou, Ziwei I-208 Zhu, He II-229 Žikeli´c, Ðor de I-3